Data Transformation & Scaling: Complete Beginner Guide for Data Science
Before building any Machine Learning model, your data needs to be clean, consistent, and in the right format.
This is where Data Transformation & Scaling comes in.
In simple words:
Data Transformation = Changing the shape or format of data
Data Scaling = Putting all numbers on a similar range
Without these steps, many ML models may produce incorrect or unstable results.
What Is Data Transformation?
Data Transformation means changing data into a useful format for analysis or machine learning.
You use transformation when:
Data contains categories
Data has skewed values
Data contains text
Data needs normalization
Data needs encoding
Common Transformation Techniques:
Encoding Categorical Data
Log Transformation
Box-Cox Transformation
Binning (Discretization)
Date/Time Transformation
Feature Construction
Let’s understand each clearly.
Encoding Categorical Variables
Machine learning models cannot understand words, only numbers.
So we convert:
“Male”, “Female” → 0, 1
“Red”, “Blue”, “Green” → numeric labels
One-Hot Encoding
Creates separate columns:
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
pd.get_dummies(df["Color"])
Label Encoding
Converts categories into numbers:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])
Log Transformation
Log transforms help remove skewness (when values are highly uneven).
Example:
Income data → usually skewed.
df["Income_log"] = np.log(df["Income"] + 1)
Adding 1 avoids log(0) error.
Box-Cox Transformation
Box-Cox also reduces skewness but works only on positive values.
from scipy.stats import boxcox
df["Transformed"], lam = boxcox(df["Column"])
Binning (Discretization)
Converting continuous values into categories.
Example: Age
0–18 → Child
18–60 → Adult
60+ → Senior
bins = [0, 18, 60, 100]
labels = ["Child", "Adult", "Senior"]
df["Age_Group"] = pd.cut(df["Age"], bins=bins, labels=labels)
Date & Time Transformation
Features like year, month, day, hour improve ML models.
df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
What Is Data Scaling?
Scaling makes sure all numbers have similar ranges.
Why?
Many ML models do distance calculations (like KNN, SVM).
Large numbers dominate small numbers and break the model.
Example:
| Feature | Values |
|---|---|
| Age | 20–60 |
| Salary | 30,000–150,000 |
Salary will overpower Age unless scaled.
Common Scaling Techniques
Min-Max Scaling
Standardization (Z-Score Scaling)
Robust Scaling
Let’s explain simply.
Min-Max Scaling (0 to 1 Scaling)
This scaling converts values into a range between 0 and 1.
Formula:
X_scaled = (X - Xmin) / (Xmax - Xmin)
Example:
Age = 40
(min = 20, max = 60)
Scaled = (40−20) / (60−20) = 0.5
Python Code:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])
When to use?
Neural networks
Distance-based models
Image data
Standardization (Z-Score Scaling)
Values are scaled to have:
Mean = 0
Standard deviation = 1
Formula:
Z = (X - mean) / std
Python Code:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])
When to use?
Linear regression
Logistic regression
SVM
PCA (Principal Component Analysis)
. Robust Scaling (Works Well with Outliers)
Uses median and IQR instead of mean and std.
Python Code:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df[["Age", "Salary"]])
When to use?
Data with heavy outliers like income
Skewed data
Real-World Example
A company has a dataset:
| Age | Salary |
|---|---|
| 25 | 30,000 |
| 30 | 60,000 |
| 50 | 150,000 |
If you train a model without scaling:
Salary dominates Age.
Model becomes biased.
After scaling:
Both features carry similar importance → ML model performs better.
End-to-End Python Example
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.DataFrame({
"Age": [25, 30, 50, 45],
"Salary": [30000, 60000, 150000, 120000]
})
# Min-Max Scaling
scaler = MinMaxScaler()
df_minmax = scaler.fit_transform(df)
# Standard Scaling
scaler2 = StandardScaler()
df_standard = scaler2.fit_transform(df)
print(df_minmax)
print(df_standard)