Encoding Categorical Variables: Simple Guide with Examples for Beginners

Machine Learning models can only understand numbers, not text.
So whenever your dataset contains categories like:

  • Gender: Male, Female

  • City: Delhi, Mumbai, Chennai

  • Color: Red, Blue, Green

  • Education: Graduate, Post-Graduate

…you must convert them into numeric values.

This process is called Encoding Categorical Variables.

What Are Categorical Variables?

Categorical variables contain text labels instead of numbers.

Types of Categorical Data

  1. Nominal (No order)
    Examples: Gender, City, Color

  2. Ordinal (Has order)
    Examples:

    • Low < Medium < High

    • Education: High School < Graduate < Postgraduate

Why Do We Need Encoding?

Because ML models cannot process text.
They only work with numerical representations.

Encoding helps:

  • Convert text to numbers

  • Make data usable for ML algorithms

  • Improve accuracy

  • Avoid errors during model training

Types of Encoding Techniques

We will cover:

  1. Label Encoding

  2. One-Hot Encoding

  3. Ordinal Encoding

  4. Target Encoding (advanced)

  5. Binary Encoding (advanced)

Let’s explain each in very simple terms.

Label Encoding

Label Encoding replaces categories with numbers.

Example:

GenderEncoded
Male1
Female0

Python Code:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

When to Use?

  • For binary categories (Yes/No, Male/Female)

  • For decision-tree based models (Random Forest, XGBoost)

 Do NOT use for:

  • Nominal data with many categories
    (City → 1, 2, 3 creates artificial order)

One-Hot Encoding

One-Hot Encoding creates separate columns for each category.

Example:

Color: Red, Blue, Green

ColorRedBlueGreen
Red100
Blue010

Python Code (Pandas):

one_hot = pd.get_dummies(df["Color"])

Scikit-learn:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["Color"]]).toarray()

When to Use?

  • For nominal categories (no order)

  • For linear models (Logistic Regression, Linear Regression)

  • For neural networks

 Drawback:

  • Creates many columns → “Curse of dimensionality”

Ordinal Encoding (For Ordered Categories)

Use when categories have a natural order.

Examples:
Education Level:

Low (1) < Medium (2) < High (3)

Python Code:

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
df["Education"] = encoder.fit_transform(df[["Education"]])

When to Use?

  • For ordinal features

  • For tree-based models

Target Encoding (Advanced Technique)

Replaces category with mean of target variable.

Example:
If “City” → average purchase rate:

CityAvg Purchase
Delhi0.72
Mumbai0.65
Chennai0.54

Useful for:

  • Large number of categories

  • High-cardinality features

 Risk:

  • Can cause data leakage

  • Always apply using train-test split

Binary Encoding (For Many Categories)

Each category is converted to binary digits.

Example:
Category → 10 → binary 1010
Creates 4 columns instead of 10 categories.

Useful for:

  • Thousands of unique values

  • NLP or high-cardinality datasets

Library: category_encoders

from category_encoders.binary import BinaryEncoder

encoder = BinaryEncoder(cols=["City"])
df_encoded = encoder.fit_transform(df)

Real-World Examples

Example 1: Bakery Customer Dataset

Columns: Gender, City, Age

  • Gender → Label Encoding

  • City → One-Hot Encoding

  • Age → No encoding needed (numeric)

Example 2: House Price Prediction

Column: “Condition” = Poor, Average, Good, Excellent
→ Ordinal Encoding (has order)

 

 Example 3: E-commerce Data

City has 300 categories
→ Use Binary or Target Encoding

Python Example: Encoding All Types

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

df = pd.DataFrame({
    "Gender": ["Male", "Female", "Male"],
    "City": ["Delhi", "Mumbai", "Delhi"],
    "Education": ["Graduate", "Post-Graduate", "High School"]
})

# Label Encoding
le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

# One-Hot Encoding
df = pd.get_dummies(df, columns=["City"])

# Ordinal Encoding
education_order = [["High School", "Graduate", "Post-Graduate"]]
ord_enc = OrdinalEncoder(categories=education_order)
df["Education"] = ord_enc.fit_transform(df[["Education"]])

print(df)