Encoding Categorical Variables: Simple Guide with Examples for Beginners
Machine Learning models can only understand numbers, not text.
So whenever your dataset contains categories like:
Gender: Male, Female
City: Delhi, Mumbai, Chennai
Color: Red, Blue, Green
Education: Graduate, Post-Graduate
…you must convert them into numeric values.
This process is called Encoding Categorical Variables.
What Are Categorical Variables?
Categorical variables contain text labels instead of numbers.
Types of Categorical Data
Nominal (No order)
Examples: Gender, City, ColorOrdinal (Has order)
Examples:Low < Medium < High
Education: High School < Graduate < Postgraduate
Why Do We Need Encoding?
Because ML models cannot process text.
They only work with numerical representations.
Encoding helps:
Convert text to numbers
Make data usable for ML algorithms
Improve accuracy
Avoid errors during model training
Types of Encoding Techniques
We will cover:
Label Encoding
One-Hot Encoding
Ordinal Encoding
Target Encoding (advanced)
Binary Encoding (advanced)
Let’s explain each in very simple terms.
Label Encoding
Label Encoding replaces categories with numbers.
Example:
| Gender | Encoded |
|---|---|
| Male | 1 |
| Female | 0 |
Python Code:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])
When to Use?
For binary categories (Yes/No, Male/Female)
For decision-tree based models (Random Forest, XGBoost)
Do NOT use for:
Nominal data with many categories
(City → 1, 2, 3 creates artificial order)
One-Hot Encoding
One-Hot Encoding creates separate columns for each category.
Example:
Color: Red, Blue, Green
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
Python Code (Pandas):
one_hot = pd.get_dummies(df["Color"])
Scikit-learn:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["Color"]]).toarray()
When to Use?
For nominal categories (no order)
For linear models (Logistic Regression, Linear Regression)
For neural networks
Drawback:
Creates many columns → “Curse of dimensionality”
Ordinal Encoding (For Ordered Categories)
Use when categories have a natural order.
Examples:
Education Level:
Low (1) < Medium (2) < High (3)
Python Code:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
df["Education"] = encoder.fit_transform(df[["Education"]])
When to Use?
For ordinal features
For tree-based models
Target Encoding (Advanced Technique)
Replaces category with mean of target variable.
Example:
If “City” → average purchase rate:
| City | Avg Purchase |
|---|---|
| Delhi | 0.72 |
| Mumbai | 0.65 |
| Chennai | 0.54 |
Useful for:
Large number of categories
High-cardinality features
Risk:
Can cause data leakage
Always apply using train-test split
Binary Encoding (For Many Categories)
Each category is converted to binary digits.
Example:
Category → 10 → binary 1010
Creates 4 columns instead of 10 categories.
Useful for:
Thousands of unique values
NLP or high-cardinality datasets
Library: category_encoders
from category_encoders.binary import BinaryEncoder
encoder = BinaryEncoder(cols=["City"])
df_encoded = encoder.fit_transform(df)
Real-World Examples
Example 1: Bakery Customer Dataset
Columns: Gender, City, Age
Gender → Label Encoding
City → One-Hot Encoding
Age → No encoding needed (numeric)
Example 2: House Price Prediction
Column: “Condition” = Poor, Average, Good, Excellent
→ Ordinal Encoding (has order)
Example 3: E-commerce Data
City has 300 categories
→ Use Binary or Target Encoding
Python Example: Encoding All Types
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
df = pd.DataFrame({
"Gender": ["Male", "Female", "Male"],
"City": ["Delhi", "Mumbai", "Delhi"],
"Education": ["Graduate", "Post-Graduate", "High School"]
})
# Label Encoding
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])
# One-Hot Encoding
df = pd.get_dummies(df, columns=["City"])
# Ordinal Encoding
education_order = [["High School", "Graduate", "Post-Graduate"]]
ord_enc = OrdinalEncoder(categories=education_order)
df["Education"] = ord_enc.fit_transform(df[["Education"]])
print(df)