Encoding Categorical Variables: Simple Guide with Examples for Beginners

Machine Learning models can only understand numbers, not text.
So whenever your dataset contains categories like:

Gender: Male, Female
City: Delhi, Mumbai, Chennai
Color: Red, Blue, Green
Education: Graduate, Post-Graduate

…you must convert them into numeric values.

This process is called Encoding Categorical Variables.

What Are Categorical Variables?

Categorical variables contain text labels instead of numbers.

Types of Categorical Data

Nominal (No order)
Examples: Gender, City, Color
Ordinal (Has order)
Examples:
- Low < Medium < High
- Education: High School < Graduate < Postgraduate

Why Do We Need Encoding?

Because ML models cannot process text.
They only work with numerical representations.

Encoding helps:

Convert text to numbers
Make data usable for ML algorithms
Improve accuracy
Avoid errors during model training

Types of Encoding Techniques

We will cover:

Label Encoding
One-Hot Encoding
Ordinal Encoding
Target Encoding (advanced)
Binary Encoding (advanced)

Let’s explain each in very simple terms.

Label Encoding

Label Encoding replaces categories with numbers.

Example:

Gender	Encoded
Male	1
Female	0

Python Code:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

When to Use?

For binary categories (Yes/No, Male/Female)
For decision-tree based models (Random Forest, XGBoost)

Do NOT use for:

Nominal data with many categories
(City → 1, 2, 3 creates artificial order)

One-Hot Encoding

One-Hot Encoding creates separate columns for each category.

Example:

Color: Red, Blue, Green

Color	Red	Blue	Green
Red	1	0	0
Blue	0	1	0

Python Code (Pandas):

one_hot = pd.get_dummies(df["Color"])

Scikit-learn:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["Color"]]).toarray()

When to Use?

For nominal categories (no order)
For linear models (Logistic Regression, Linear Regression)
For neural networks

Drawback:

Creates many columns → “Curse of dimensionality”

Ordinal Encoding (For Ordered Categories)

Use when categories have a natural order.

Examples:
Education Level:

Low (1) < Medium (2) < High (3)

Python Code:

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
df["Education"] = encoder.fit_transform(df[["Education"]])

When to Use?

For ordinal features
For tree-based models

Target Encoding (Advanced Technique)

Replaces category with mean of target variable.

Example:
If “City” → average purchase rate:

City	Avg Purchase
Delhi	0.72
Mumbai	0.65
Chennai	0.54

Useful for:

Large number of categories
High-cardinality features

Risk:

Can cause data leakage
Always apply using train-test split

Binary Encoding (For Many Categories)

Each category is converted to binary digits.

Example:
Category → 10 → binary 1010
Creates 4 columns instead of 10 categories.

Useful for:

Thousands of unique values
NLP or high-cardinality datasets

Library: category_encoders

from category_encoders.binary import BinaryEncoder

encoder = BinaryEncoder(cols=["City"])
df_encoded = encoder.fit_transform(df)

Real-World Examples

Example 1: Bakery Customer Dataset

Columns: Gender, City, Age

Gender → Label Encoding
City → One-Hot Encoding
Age → No encoding needed (numeric)

Example 2: House Price Prediction

Column: “Condition” = Poor, Average, Good, Excellent
→ Ordinal Encoding (has order)

Example 3: E-commerce Data

City has 300 categories
→ Use Binary or Target Encoding

Python Example: Encoding All Types

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

df = pd.DataFrame({
    "Gender": ["Male", "Female", "Male"],
    "City": ["Delhi", "Mumbai", "Delhi"],
    "Education": ["Graduate", "Post-Graduate", "High School"]
})

# Label Encoding
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

# One-Hot Encoding
df = pd.get_dummies(df, columns=["City"])

# Ordinal Encoding
education_order = [["High School", "Graduate", "Post-Graduate"]]
ord_enc = OrdinalEncoder(categories=education_order)
df["Education"] = ord_enc.fit_transform(df[["Education"]])

print(df)

Blog