Summary Statistics & Data Distribution: Simple Guide for Data Science
In Exploratory Data Analysis (EDA), the first goal is to understand your dataset.
Two essential steps are:
Summary Statistics
Understanding Data Distribution
These steps help you understand how your data behaves, whether it contains outliers, and what transformations or models you should apply next.
What Are Summary Statistics?
Summary Statistics are quick numbers that describe the key characteristics of your data.
They help you answer questions like:
What is the average value?
How spread out is the data?
What is the highest and lowest value?
Are there outliers?
Think of summary statistics as a quick health check for your dataset.
Important Summary Statistics You Must Know
Let’s learn each in very simple terms.
Mean (Average)
Add all numbers → divide by count.
Example:
[10, 20, 30] → Mean = 20
Median (Middle Value)
Sort numbers → pick the middle one.
Useful when data contains outliers (e.g., salaries).
Mode (Most Frequent Value)
The value that appears most often.
Minimum & Maximum
Lowest and highest values in the dataset.
Range
Difference → max − min
Shows how spread out data is.
Variance
Measures how far values are from the mean.
Standard Deviation (Std Dev)
Square root of variance.
Shows how much values vary overall.
Small Std Dev → values close together
Large Std Dev → values spread out
Percentiles / Quartiles
Percentiles show what percentage of values fall below a number.
Common ones:
Q1 (25th percentile)
Q2 (50th percentile = median)
Q3 (75th percentile)
Used for box plots and outlier detection.
Summary Statistics in Python (Pandas)
Use the describe() function:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.describe())
Output includes:
count
mean
std
min
25%
50%
75%
max
What Is Data Distribution?
Data distribution describes how the values in your dataset are spread out.
It tells you:
Is the data centered around a specific value?
Does it have outliers?
Is it symmetric or skewed?
Does it follow a known pattern (like normal distribution)?
Understanding distribution helps you choose:
The right statistical tests
The right machine learning algorithms
The correct preprocessing steps
Types of Data Distributions
Normal Distribution (Bell Curve)
Most values are around the mean.
Examples: height, test scores.
Characteristics:
Symmetrical
Mean = Median = Mode
Skewed Distribution
Right-Skewed (Positive Skew)
Tail on the right.
Examples: salaries, property prices.
Mean > Median
Left-Skewed (Negative Skew)
Tail on the left.
Examples: exam scores where most students score high.
Mean < Median
Uniform Distribution
All values occur with equal probability.
Example: Random number generator.
Bimodal Distribution
Two peaks.
Example: heights of male + female combined.
Multimodal Distribution
More than two peaks.
How to Visualize Data Distribution (EDA Essentials)
Histogram
Shows frequency of values.
import matplotlib.pyplot as plt
df["Age"].hist()
plt.show()
Kernel Density Plot (KDE)
Smooth curve showing distribution.
import seaborn as sns
sns.kdeplot(df["Income"])
Box Plot
Shows median, quartiles & outliers.
sns.boxplot(df["Salary"])Violin Plot
Shows shape + density + box plot in one graph.
Why Summary Statistics & Distribution Are Important?
They help you:
Detect outliers
Understand skewness
Choose scaling methods
Decide transformations (log, box-cox)
Validate assumptions for algorithms (e.g., linear regression needs normality)
Prepare data for modeling
Real-World Example
Dataset: Student Marks
[45, 55, 60, 62, 85, 92]
Summary Statistics
Mean = ~66
Median = 61
Std Dev = high (values vary)
Range = 92–45 = 47
Distribution
Right-skewed → some students scored very high.
Insights
Needs normalization or transformation
Models must handle skewness