Handling Missing Values in Data Science: Complete Beginner Guide
Missing values are one of the biggest problems in real-world datasets.
When you collect data from websites, surveys, APIs, or databases, some values may be:
Missing
Blank
Incorrect
NaN (Not a Number)
Null
If you don’t handle missing data properly, your analysis and machine learning models will give wrong results.
What Are Missing Values?
A missing value means there is no data for that entry.
Example table:
| Name | Age | Salary |
|---|---|---|
| Ravi | 25 | 50,000 |
| Tina | 48,000 | |
| Ajay | 30 |
Tina’s Age and Ajay’s Salary are missing.
In Pandas, missing values appear as:
NaNNoneNaT
Why Do Missing Values Occur?
Missing values happen due to:
People skipping survey questions
Errors during data entry
Sensors failing to record data
API failures
Data corruption
System crashes
Incomplete uploads
In Data Science, handling missing values is a mandatory step before analysis
How to Detect Missing Values in Pandas
First import Pandas:
import pandas as pd
Check missing values:
df.isnull().sum()
This tells you how many missing values are in each column.
Techniques to Handle Missing Values
There is no single best method.
You choose what is best based on the dataset and problem type.
Let’s learn the most popular techniques.
Remove Missing Values (Deletion Method)
Remove rows with missing values
df.dropna(inplace=True)
Remove columns with too many missing values
df.dropna(axis=1)
When to use?
When missing data is very small
When dropped rows will not affect analysis
When NOT to use?
When large amount of data will be lost
Fill Missing Values (Imputation Method)
Imputation means filling the missing value with a suitable replacement.
Common Imputation Techniques
Fill with a constant value
Useful for categories (e.g., “Unknown”)
df["City"].fillna("Unknown", inplace=True)
Fill with Mean (for numerical data)
df["Age"].fillna(df["Age"].mean(), inplace=True)
When to use?
When data does not contain extreme outliers
Fill with Median
Best when data is skewed or has outliers.
df["Income"].fillna(df["Income"].median(), inplace=True)
Fill with Mode (most common value)
df["Gender"].fillna(df["Gender"].mode()[0], inplace=True)
Useful for categorical columns.
Forward Fill (use previous value)
df.fillna(method="ffill")
Example:
Useful for time-series data.
Backward Fill (use next value)
df.fillna(method="bfill")
Advanced Techniques (For Higher Accuracy)
These are used in professional Data Science & Machine Learning projects.
KNN Imputation
Uses nearest neighbors to fill values.
Regression Imputation
Predict missing value using ML model.
Multiple Imputation
Creates multiple versions of dataset and averages them.
Interpolation
Used in time-series (e.g., stock prices).
df.interpolate()
Real-World Example
Example dataset:
| Customer | Age | Salary | City |
|---|---|---|---|
| A | 25 | 50,000 | Delhi |
| B | NaN | 45,000 | Mumbai |
| C | 30 | NaN | NaN |
Steps:
- Fill missing Age with median
- Fill missing Salary with mean
- Fill missing City with “Unknown”
After fixing missing values, the dataset becomes:
| Customer | Age | Salary | City |
|---|---|---|---|
| A | 25 | 50,000 | Delhi |
| B | 27.5 (median) | 45,000 | Mumbai |
| C | 30 | 47,500 (mean) | Unknown |
Now the dataset is clean and ready for analysis.
Python Example: Handling Missing Values
import pandas as pd
df = pd.read_csv("data.csv")
# Check missing values
print(df.isnull().sum())
# Fill numeric columns with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# Fill categorical columns with mode
df["City"] = df["City"].fillna(df["City"].mode()[0])
# Remove rows with too many NaNs
df = df.dropna(thresh=3)
print(df.head())