Handling Missing Values in Data Science: Complete Beginner Guide

Missing values are one of the biggest problems in real-world datasets.
When you collect data from websites, surveys, APIs, or databases, some values may be:

  • Missing

  • Blank

  • Incorrect

  • NaN (Not a Number)

  • Null

If you don’t handle missing data properly, your analysis and machine learning models will give wrong results.

What Are Missing Values?

A missing value means there is no data for that entry.

Example table:

NameAgeSalary
Ravi2550,000
Tina 48,000
Ajay30 

Tina’s Age and Ajay’s Salary are missing.

In Pandas, missing values appear as:

  • NaN

  • None

  • NaT

Why Do Missing Values Occur?

Missing values happen due to:

  • People skipping survey questions

  • Errors during data entry

  • Sensors failing to record data

  • API failures

  • Data corruption

  • System crashes

  • Incomplete uploads

In Data Science, handling missing values is a mandatory step before analysis

How to Detect Missing Values in Pandas

First import Pandas:

import pandas as pd

Check missing values:

df.isnull().sum()

This tells you how many missing values are in each column.

Techniques to Handle Missing Values

There is no single best method.
You choose what is best based on the dataset and problem type.

Let’s learn the most popular techniques.

Remove Missing Values (Deletion Method)

Remove rows with missing values

df.dropna(inplace=True)

Remove columns with too many missing values

df.dropna(axis=1)

When to use?

  • When missing data is very small

  • When dropped rows will not affect analysis

 When NOT to use?

  • When large amount of data will be lost

Fill Missing Values (Imputation Method)

Imputation means filling the missing value with a suitable replacement.

Common Imputation Techniques

Fill with a constant value

Useful for categories (e.g., “Unknown”)

df["City"].fillna("Unknown", inplace=True)

Fill with Mean (for numerical data)

df["Age"].fillna(df["Age"].mean(), inplace=True)

When to use?

  • When data does not contain extreme outliers

Fill with Median

Best when data is skewed or has outliers.

df["Income"].fillna(df["Income"].median(), inplace=True)

Fill with Mode (most common value)

df["Gender"].fillna(df["Gender"].mode()[0], inplace=True)

Useful for categorical columns.

Forward Fill (use previous value)

df.fillna(method="ffill")

Example:

Useful for time-series data.

Backward Fill (use next value)

df.fillna(method="bfill")

Advanced Techniques (For Higher Accuracy)

These are used in professional Data Science & Machine Learning projects.

KNN Imputation

Uses nearest neighbors to fill values.

Regression Imputation

Predict missing value using ML model.

Multiple Imputation

Creates multiple versions of dataset and averages them.

Interpolation

Used in time-series (e.g., stock prices).

df.interpolate()

Real-World Example

Example dataset:

CustomerAgeSalaryCity
A2550,000Delhi
BNaN45,000Mumbai
C30NaNNaN

Steps:

  1.  Fill missing Age with median
  2. Fill missing Salary with mean
  3. Fill missing City with “Unknown”

After fixing missing values, the dataset becomes:

CustomerAgeSalaryCity
A2550,000Delhi
B27.5 (median)45,000Mumbai
C3047,500 (mean)Unknown

Now the dataset is clean and ready for analysis.

Python Example: Handling Missing Values

import pandas as pd

df = pd.read_csv("data.csv")

# Check missing values
print(df.isnull().sum())

# Fill numeric columns with median
df["Age"] = df["Age"].fillna(df["Age"].median())

# Fill categorical columns with mode
df["City"] = df["City"].fillna(df["City"].mode()[0])

# Remove rows with too many NaNs
df = df.dropna(thresh=3)

print(df.head())