Handling Missing Values in Data Science: Complete Beginner Guide

Missing values are one of the biggest problems in real-world datasets.
When you collect data from websites, surveys, APIs, or databases, some values may be:

Missing
Blank
Incorrect
NaN (Not a Number)
Null

If you don’t handle missing data properly, your analysis and machine learning models will give wrong results.

What Are Missing Values?

A missing value means there is no data for that entry.

Example table:

Name	Age	Salary
Ravi	25	50,000
Tina		48,000
Ajay	30

Tina’s Age and Ajay’s Salary are missing.

In Pandas, missing values appear as:

NaN
None
NaT

Why Do Missing Values Occur?

Missing values happen due to:

People skipping survey questions
Errors during data entry
Sensors failing to record data
API failures
Data corruption
System crashes
Incomplete uploads

In Data Science, handling missing values is a mandatory step before analysis

How to Detect Missing Values in Pandas

First import Pandas:

import pandas as pd

Check missing values:

df.isnull().sum()

This tells you how many missing values are in each column.

Techniques to Handle Missing Values

There is no single best method.
You choose what is best based on the dataset and problem type.

Let’s learn the most popular techniques.

Remove Missing Values (Deletion Method)

Remove rows with missing values

df.dropna(inplace=True)

Remove columns with too many missing values

df.dropna(axis=1)

When to use?

When missing data is very small
When dropped rows will not affect analysis

When NOT to use?

When large amount of data will be lost

Fill Missing Values (Imputation Method)

Imputation means filling the missing value with a suitable replacement.

Common Imputation Techniques

Fill with a constant value

Useful for categories (e.g., “Unknown”)

df["City"].fillna("Unknown", inplace=True)

Fill with Mean (for numerical data)

df["Age"].fillna(df["Age"].mean(), inplace=True)

When to use?

When data does not contain extreme outliers

Fill with Median

Best when data is skewed or has outliers.

df["Income"].fillna(df["Income"].median(), inplace=True)

Fill with Mode (most common value)

df["Gender"].fillna(df["Gender"].mode()[0], inplace=True)

Useful for categorical columns.

Forward Fill (use previous value)

df.fillna(method="ffill")

Example:

Useful for time-series data.

Backward Fill (use next value)

df.fillna(method="bfill")

Advanced Techniques (For Higher Accuracy)

These are used in professional Data Science & Machine Learning projects.

KNN Imputation

Uses nearest neighbors to fill values.

Regression Imputation

Predict missing value using ML model.

Multiple Imputation

Creates multiple versions of dataset and averages them.

Interpolation

Used in time-series (e.g., stock prices).

df.interpolate()

Real-World Example

Example dataset:

Customer	Age	Salary	City
A	25	50,000	Delhi
B	NaN	45,000	Mumbai
C	30	NaN	NaN

Steps:

Fill missing Age with median
Fill missing Salary with mean
Fill missing City with “Unknown”

After fixing missing values, the dataset becomes:

Customer	Age	Salary	City
A	25	50,000	Delhi
B	27.5 (median)	45,000	Mumbai
C	30	47,500 (mean)	Unknown

Now the dataset is clean and ready for analysis.

Python Example: Handling Missing Values

import pandas as pd

df = pd.read_csv("data.csv")

# Check missing values
print(df.isnull().sum())

# Fill numeric columns with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Fill categorical columns with mode
df["City"] = df["City"].fillna(df["City"].mode()[0])

# Remove rows with too many NaNs
df = df.dropna(thresh=3)

print(df.head())

Blog