Data Transformation & Scaling: Complete Beginner Guide for Data Science

Before building any Machine Learning model, your data needs to be clean, consistent, and in the right format.
This is where Data Transformation & Scaling comes in.

In simple words:

Data Transformation = Changing the shape or format of data
Data Scaling = Putting all numbers on a similar range

Without these steps, many ML models may produce incorrect or unstable results.

What Is Data Transformation?

Data Transformation means changing data into a useful format for analysis or machine learning.

You use transformation when:

  • Data contains categories

  • Data has skewed values

  • Data contains text

  • Data needs normalization

  • Data needs encoding

Common Transformation Techniques:

  1. Encoding Categorical Data

  2. Log Transformation

  3. Box-Cox Transformation

  4. Binning (Discretization)

  5. Date/Time Transformation

  6. Feature Construction

Let’s understand each clearly.

Encoding Categorical Variables

Machine learning models cannot understand words, only numbers.

So we convert:

  • “Male”, “Female” → 0, 1

  • “Red”, “Blue”, “Green” → numeric labels

One-Hot Encoding

Creates separate columns:

ColorRedBlueGreen
Red100
pd.get_dummies(df["Color"])

Label Encoding

Converts categories into numbers:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

Log Transformation

Log transforms help remove skewness (when values are highly uneven).

Example:
Income data → usually skewed.

df["Income_log"] = np.log(df["Income"] + 1)

Adding 1 avoids log(0) error.

Box-Cox Transformation

Box-Cox also reduces skewness but works only on positive values.

from scipy.stats import boxcox
df["Transformed"], lam = boxcox(df["Column"])

Binning (Discretization)

Converting continuous values into categories.

Example: Age

  • 0–18 → Child

  • 18–60 → Adult

  • 60+ → Senior

bins = [0, 18, 60, 100]
labels = ["Child", "Adult", "Senior"]
df["Age_Group"] = pd.cut(df["Age"], bins=bins, labels=labels)

Date & Time Transformation

Features like year, month, day, hour improve ML models.

df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month

What Is Data Scaling?

Scaling makes sure all numbers have similar ranges.

Why?
Many ML models do distance calculations (like KNN, SVM).
Large numbers dominate small numbers and break the model.

Example:

FeatureValues
Age20–60
Salary30,000–150,000

Salary will overpower Age unless scaled.

Common Scaling Techniques

  1. Min-Max Scaling

  2. Standardization (Z-Score Scaling)

  3. Robust Scaling

Let’s explain simply.

Min-Max Scaling (0 to 1 Scaling)

This scaling converts values into a range between 0 and 1.

Formula:

X_scaled = (X - Xmin) / (Xmax - Xmin)

Example:

Age = 40
(min = 20, max = 60)

Scaled = (40−20) / (60−20) = 0.5

Python Code:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

  • Neural networks

  • Distance-based models

  • Image data

Standardization (Z-Score Scaling)

Values are scaled to have:

  • Mean = 0

  • Standard deviation = 1

Formula:

Z = (X - mean) / std

Python Code:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

  • Linear regression

  • Logistic regression

  • SVM

  • PCA (Principal Component Analysis)

. Robust Scaling (Works Well with Outliers)

Uses median and IQR instead of mean and std.

Python Code:

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

  • Data with heavy outliers like income

  • Skewed data

Real-World Example

A company has a dataset:

AgeSalary
2530,000
3060,000
50150,000

If you train a model without scaling:
Salary dominates Age.
Model becomes biased.

After scaling:
Both features carry similar importance → ML model performs better.

End-to-End Python Example

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.DataFrame({
    "Age": [25, 30, 50, 45],
    "Salary": [30000, 60000, 150000, 120000]
})

# Min-Max Scaling
scaler = MinMaxScaler()
df_minmax = scaler.fit_transform(df)

# Standard Scaling
scaler2 = StandardScaler()
df_standard = scaler2.fit_transform(df)

print(df_minmax)
print(df_standard)