Model Evaluation Metrics: Accuracy, Precision, Recall & F1 Score Explained

After building a Machine Learning model, the most important question is:

“How good is my model?”

Model Evaluation Metrics help us measure model performance and decide whether the model is reliable or not.

This article explains the four most important classification metrics:

Accuracy
Precision
Recall
F1 Score

Why Do We Need Evaluation Metrics?

A model can look good but still be wrong or misleading.

Evaluation metrics help to:

Measure correctness
Compare models
Avoid wrong business decisions
Improve model performance

First Understand This: Confusion Matrix

All these metrics are based on the Confusion Matrix.

Example: Disease Prediction (Yes/No)

Actual \ Predicted	Yes	No
Yes	TP	FN
No	FP	TN

What do these mean?

TP (True Positive) → Correctly predicted Yes
TN (True Negative) → Correctly predicted No
FP (False Positive) → Predicted Yes but actually No
FN (False Negative) → Predicted No but actually Yes

Accuracy

Accuracy tells us how many predictions were correct overall.

Formula:

Simple Meaning:

Out of all predictions, how many were correct?

Example:

Total predictions = 100
Correct predictions = 90

Accuracy = 90%

When Accuracy Is Good?

When data is balanced
When False Positives and False Negatives have equal importance

When Accuracy Is NOT Enough?

When data is imbalanced (e.g., fraud detection)

Precision

Precision tells us how many predicted positives were actually correct.

Formula:

Simple Meaning:

When the model says YES, how often is it correct?

Example (Spam Email):

Emails marked as spam = 20
Actually spam = 15

Precision = 15 / 20 = 75%

When Precision Is Important?

Spam detection
Fraud alerts
Legal cases

You want fewer false alarms.

Recall (Sensitivity)

Recall tells us how many actual positives were correctly detected.

Formula:

Simple Meaning:

Out of all real YES cases, how many did the model catch?

Example (Disease Detection):

Actual sick patients = 50
Correctly detected = 45

Recall = 90%

When Recall Is Important?

Disease detection
Fraud detection
Safety systems

You don’t want to miss real cases.

F1 Score

F1 Score is the balance between Precision and Recall.

Formula:

Simple Meaning:

One number that considers both false positives and false negatives.

When to Use F1 Score?

When data is imbalanced
When both Precision & Recall are important

Accuracy vs Precision vs Recall vs F1

Metric	What It Focuses On	Best Used When
Accuracy	Overall correctness	Balanced data
Precision	Avoid false positives	Spam, fraud alerts
Recall	Avoid false negatives	Disease detection
F1 Score	Balance of precision & recall	Imbalanced data

Real-Life Example

Disease Test

Accuracy → Overall test correctness
Precision → If test says sick, is patient really sick?
Recall → Did we catch all sick patients?
F1 Score → Balance between false alarms and missed cases

In healthcare → Recall is most important
In spam detection → Precision is most important

Python Example (Using Scikit-learn)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = [1, 0, 1, 1, 0, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))

Blog