Data Manipulation & Cleaning with Pandas

In Data Science, 80% of your time is spent cleaning and manipulating data, not building machine learning models.
Pandas is the most important Python library for these tasks.

This guide explains all the essential methods for real-world data cleaning, with easy examples and clear explanations.

Why Data Cleaning Is Important

Real datasets usually contain:

  • Missing values

  • Incorrect data

  • Duplicates

  • Wrong formats

  • Outliers

  • Mixed data types

Without cleaning, your analysis and model results will be incorrect.

Importing Pandas

 

import pandas as pd

Loading Data

Pandas can read various file formats.

CSV

df = pd.read_csv("data.csv")

Excel

df = pd.read_excel("file.xlsx")

Display top rows

df.head()

Handling Missing Values

Missing values appear as NaN.

Check missing values

df.isnull().sum()

A. Removing Missing Rows

df.dropna(inplace=True)

B. Filling Missing Values

Fill with a constant

df.fillna(0)

Fill with mean

df["Age"].fillna(df["Age"].mean(), inplace=True)

Fill with median

df["Salary"].fillna(df["Salary"].median(), inplace=True)

Forward fill

df.fillna(method="ffill")

Removing Duplicates

Duplicate rows cause biased results.

Find duplicates

df.duplicated().sum()

Remove duplicates

df.drop_duplicates(inplace=True)

Handling Incorrect Data Types

Check data types

df.dtypes

Convert column type

df["Date"] = pd.to_datetime(df["Date"])
df["Age"] = df["Age"].astype(int)

Renaming Columns

Proper names make analysis easy.

df.rename(columns={"oldName": "NewName"}, inplace=True)

Filtering Data

Filter using conditions

df[df["Age"] > 25]

Filter multiple conditions

df[(df["Age"] > 25) & (df["Gender"] == "Female")]

Sorting Data

Sort values

df.sort_values("Salary", ascending=False)

Creating New Columns

Create calculated column

df["Bonus"] = df["Salary"] * 0.10

Create column based on conditions

df["Category"] = df["Age"].apply(lambda x: "Senior" if x > 50 else "Junior")

Applying Functions

Using apply()

df["Score2"] = df["Score"].apply(lambda x: x * 2)

Merging & Joining DataFrames

Used to combine multiple datasets.

Merge on a common column

merged = pd.merge(df1, df2, on="ID")

Left join

joined = df1.merge(df2, how="left", on="ID")

Grouping & Aggregation

Useful for summarizing data.

Group by column

df.groupby("Department")["Salary"].mean()

Multiple aggregations

df.groupby("Department").agg({"Salary": ["mean", "max"], "Age": "median"})

Data Cleaning for Strings

Text cleaning is important for real datasets.

Lowercase conversion

df["Name"] = df["Name"].str.lower()

Remove whitespace

df["City"] = df["City"].str.strip()

Replace values

df["Gender"].replace({"M": "Male", "F": "Female"}, inplace=True)

Detecting Outliers

Using IQR

Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df["Salary"] < (Q1 - 1.5*IQR)) | (df["Salary"] > (Q3 + 1.5*IQR))]

Why Pandas Is the Best Tool for Data Cleaning

Because Pandas offers:

  • Fast processing

  • Easy filtering

  • Simple syntax

  • Excel-like operations

  • Seamless integration with NumPy, Matplotlib & Seaborn

  • Essential functions for ML-friendly datasets