Data Manipulation & Cleaning with Pandas
In Data Science, 80% of your time is spent cleaning and manipulating data, not building machine learning models.
Pandas is the most important Python library for these tasks.
This guide explains all the essential methods for real-world data cleaning, with easy examples and clear explanations.
Why Data Cleaning Is Important
Real datasets usually contain:
Missing values
Incorrect data
Duplicates
Wrong formats
Outliers
Mixed data types
Without cleaning, your analysis and model results will be incorrect.
Importing Pandas
import pandas as pdLoading Data
Pandas can read various file formats.
CSV
df = pd.read_csv("data.csv")
Excel
df = pd.read_excel("file.xlsx")
Display top rows
df.head()
Handling Missing Values
Missing values appear as NaN.
Check missing values
df.isnull().sum()
A. Removing Missing Rows
df.dropna(inplace=True)
B. Filling Missing Values
Fill with a constant
df.fillna(0)
Fill with mean
df["Age"].fillna(df["Age"].mean(), inplace=True)
Fill with median
df["Salary"].fillna(df["Salary"].median(), inplace=True)
Forward fill
df.fillna(method="ffill")
Removing Duplicates
Duplicate rows cause biased results.
Find duplicates
df.duplicated().sum()
Remove duplicates
df.drop_duplicates(inplace=True)
Handling Incorrect Data Types
Check data types
df.dtypes
Convert column type
df["Date"] = pd.to_datetime(df["Date"])
df["Age"] = df["Age"].astype(int)
Renaming Columns
Proper names make analysis easy.
df.rename(columns={"oldName": "NewName"}, inplace=True)Filtering Data
Filter using conditions
df[df["Age"] > 25]
Filter multiple conditions
df[(df["Age"] > 25) & (df["Gender"] == "Female")]
Sorting Data
Sort values
df.sort_values("Salary", ascending=False)
Creating New Columns
Create calculated column
df["Bonus"] = df["Salary"] * 0.10
Create column based on conditions
df["Category"] = df["Age"].apply(lambda x: "Senior" if x > 50 else "Junior")
Applying Functions
Using apply()
df["Score2"] = df["Score"].apply(lambda x: x * 2)
Merging & Joining DataFrames
Used to combine multiple datasets.
Merge on a common column
merged = pd.merge(df1, df2, on="ID")
Left join
joined = df1.merge(df2, how="left", on="ID")
Grouping & Aggregation
Useful for summarizing data.
Group by column
df.groupby("Department")["Salary"].mean()
Multiple aggregations
df.groupby("Department").agg({"Salary": ["mean", "max"], "Age": "median"})
Data Cleaning for Strings
Text cleaning is important for real datasets.
Lowercase conversion
df["Name"] = df["Name"].str.lower()
Remove whitespace
df["City"] = df["City"].str.strip()
Replace values
df["Gender"].replace({"M": "Male", "F": "Female"}, inplace=True)
Detecting Outliers
Using IQR
Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["Salary"] < (Q1 - 1.5*IQR)) | (df["Salary"] > (Q3 + 1.5*IQR))]
Why Pandas Is the Best Tool for Data Cleaning
Because Pandas offers:
Fast processing
Easy filtering
Simple syntax
Excel-like operations
Seamless integration with NumPy, Matplotlib & Seaborn
Essential functions for ML-friendly datasets