Data Transformation & Scaling: Complete Beginner Guide for Data Science

Before building any Machine Learning model, your data needs to be clean, consistent, and in the right format.
This is where Data Transformation & Scaling comes in.

In simple words:

Data Transformation = Changing the shape or format of data
Data Scaling = Putting all numbers on a similar range

Without these steps, many ML models may produce incorrect or unstable results.

What Is Data Transformation?

Data Transformation means changing data into a useful format for analysis or machine learning.

You use transformation when:

Data contains categories
Data has skewed values
Data contains text
Data needs normalization
Data needs encoding

Common Transformation Techniques:

Encoding Categorical Data
Log Transformation
Box-Cox Transformation
Binning (Discretization)
Date/Time Transformation
Feature Construction

Let’s understand each clearly.

Encoding Categorical Variables

Machine learning models cannot understand words, only numbers.

So we convert:

“Male”, “Female” → 0, 1
“Red”, “Blue”, “Green” → numeric labels

One-Hot Encoding

Creates separate columns:

Color	Red	Blue	Green
Red	1	0	0

pd.get_dummies(df["Color"])

Label Encoding

Converts categories into numbers:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

Log Transformation

Log transforms help remove skewness (when values are highly uneven).

Example:
Income data → usually skewed.

df["Income_log"] = np.log(df["Income"] + 1)

Adding 1 avoids log(0) error.

Box-Cox Transformation

Box-Cox also reduces skewness but works only on positive values.

from scipy.stats import boxcox
df["Transformed"], lam = boxcox(df["Column"])

Binning (Discretization)

Converting continuous values into categories.

Example: Age

0–18 → Child
18–60 → Adult
60+ → Senior

bins = [0, 18, 60, 100]
labels = ["Child", "Adult", "Senior"]
df["Age_Group"] = pd.cut(df["Age"], bins=bins, labels=labels)

Date & Time Transformation

Features like year, month, day, hour improve ML models.

df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month

What Is Data Scaling?

Scaling makes sure all numbers have similar ranges.

Why?
Many ML models do distance calculations (like KNN, SVM).
Large numbers dominate small numbers and break the model.

Example:

Feature	Values
Age	20–60
Salary	30,000–150,000

Salary will overpower Age unless scaled.

Common Scaling Techniques

Min-Max Scaling
Standardization (Z-Score Scaling)
Robust Scaling

Let’s explain simply.

Min-Max Scaling (0 to 1 Scaling)

This scaling converts values into a range between 0 and 1.

Formula:

X_scaled = (X - Xmin) / (Xmax - Xmin)

Example:

Age = 40
(min = 20, max = 60)

Scaled = (40−20) / (60−20) = 0.5

Python Code:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

Neural networks
Distance-based models
Image data

Standardization (Z-Score Scaling)

Values are scaled to have:

Mean = 0
Standard deviation = 1

Formula:

Z = (X - mean) / std

Python Code:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

Linear regression
Logistic regression
SVM
PCA (Principal Component Analysis)

. Robust Scaling (Works Well with Outliers)

Uses median and IQR instead of mean and std.

Python Code:

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])

When to use?

Data with heavy outliers like income
Skewed data

Real-World Example

A company has a dataset:

Age	Salary
25	30,000
30	60,000
50	150,000

If you train a model without scaling:
Salary dominates Age.
Model becomes biased.

After scaling:
Both features carry similar importance → ML model performs better.

End-to-End Python Example

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.DataFrame({
    "Age": [25, 30, 50, 45],
    "Salary": [30000, 60000, 150000, 120000]
})

# Min-Max Scaling
scaler = MinMaxScaler()
df_minmax = scaler.fit_transform(df)

# Standard Scaling
scaler2 = StandardScaler()
df_standard = scaler2.fit_transform(df)

print(df_minmax)
print(df_standard)