Summary Statistics & Data Distribution: Simple Guide for Data Science

In Exploratory Data Analysis (EDA), the first goal is to understand your dataset.
Two essential steps are:

  1. Summary Statistics

  2. Understanding Data Distribution

These steps help you understand how your data behaves, whether it contains outliers, and what transformations or models you should apply next.

What Are Summary Statistics?

Summary Statistics are quick numbers that describe the key characteristics of your data.
They help you answer questions like:

  • What is the average value?

  • How spread out is the data?

  • What is the highest and lowest value?

  • Are there outliers?

Think of summary statistics as a quick health check for your dataset.

Important Summary Statistics You Must Know

Let’s learn each in very simple terms.

Mean (Average)

Add all numbers → divide by count.

Example:
[10, 20, 30] → Mean = 20

Median (Middle Value)

Sort numbers → pick the middle one.

Useful when data contains outliers (e.g., salaries).

Mode (Most Frequent Value)

The value that appears most often.

Minimum & Maximum

Lowest and highest values in the dataset.

Range

Difference → max − min
Shows how spread out data is.

Variance

Measures how far values are from the mean.

Standard Deviation (Std Dev)

Square root of variance.
Shows how much values vary overall.

  • Small Std Dev → values close together

  • Large Std Dev → values spread out

Percentiles / Quartiles

Percentiles show what percentage of values fall below a number.

Common ones:

  • Q1 (25th percentile)

  • Q2 (50th percentile = median)

  • Q3 (75th percentile)

Used for box plots and outlier detection.

Summary Statistics in Python (Pandas)

Use the describe() function:

import pandas as pd

df = pd.read_csv("data.csv")
print(df.describe())

Output includes:

  • count

  • mean

  • std

  • min

  • 25%

  • 50%

  • 75%

  • max

What Is Data Distribution?

Data distribution describes how the values in your dataset are spread out.

It tells you:

  • Is the data centered around a specific value?

  • Does it have outliers?

  • Is it symmetric or skewed?

  • Does it follow a known pattern (like normal distribution)?

Understanding distribution helps you choose:

  • The right statistical tests

  • The right machine learning algorithms

  • The correct preprocessing steps

Types of Data Distributions

Normal Distribution (Bell Curve)

Most values are around the mean.
Examples: height, test scores.

Characteristics:

  • Symmetrical

  • Mean = Median = Mode

Skewed Distribution

Right-Skewed (Positive Skew)

Tail on the right.
Examples: salaries, property prices.

Mean > Median

Left-Skewed (Negative Skew)

Tail on the left.
Examples: exam scores where most students score high.

Mean < Median

Uniform Distribution

All values occur with equal probability.
Example: Random number generator.

Bimodal Distribution

Two peaks.
Example: heights of male + female combined.

Multimodal Distribution

More than two peaks.

How to Visualize Data Distribution (EDA Essentials)

Histogram

Shows frequency of values.

import matplotlib.pyplot as plt
df["Age"].hist()
plt.show()

Kernel Density Plot (KDE)

Smooth curve showing distribution.

import seaborn as sns
sns.kdeplot(df["Income"])

Box Plot

Shows median, quartiles & outliers.

sns.boxplot(df["Salary"])

Violin Plot

Shows shape + density + box plot in one graph.

Why Summary Statistics & Distribution Are Important?

They help you:

  • Detect outliers

  • Understand skewness

  • Choose scaling methods

  • Decide transformations (log, box-cox)

  • Validate assumptions for algorithms (e.g., linear regression needs normality)

  • Prepare data for modeling

Real-World Example

Dataset: Student Marks
[45, 55, 60, 62, 85, 92]

 

Summary Statistics

  • Mean = ~66

  • Median = 61

  • Std Dev = high (values vary)

  • Range = 92–45 = 47

Distribution

Right-skewed → some students scored very high.

 

Insights

  • Needs normalization or transformation

  • Models must handle skewness