Model Evaluation Metrics: Accuracy, Precision, Recall & F1 Score Explained

After building a Machine Learning model, the most important question is:

“How good is my model?”

Model Evaluation Metrics help us measure model performance and decide whether the model is reliable or not.

This article explains the four most important classification metrics:

  • Accuracy

  • Precision

  • Recall

  • F1 Score

Why Do We Need Evaluation Metrics?

A model can look good but still be wrong or misleading.

Evaluation metrics help to:

  • Measure correctness

  • Compare models

  • Avoid wrong business decisions

  • Improve model performance

First Understand This: Confusion Matrix

All these metrics are based on the Confusion Matrix.

Example: Disease Prediction (Yes/No)

Actual \ PredictedYesNo
YesTPFN
NoFPTN

What do these mean?

  • TP (True Positive) → Correctly predicted Yes

  • TN (True Negative) → Correctly predicted No

  • FP (False Positive) → Predicted Yes but actually No

  • FN (False Negative) → Predicted No but actually Yes

Accuracy

Accuracy tells us how many predictions were correct overall.

 

 Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

 

Simple Meaning:

Out of all predictions, how many were correct?

 

Example:

Total predictions = 100
Correct predictions = 90

Accuracy = 90%

 

When Accuracy Is Good?

  • When data is balanced

  • When False Positives and False Negatives have equal importance

When Accuracy Is NOT Enough?

  • When data is imbalanced (e.g., fraud detection)

Precision

Precision tells us how many predicted positives were actually correct.

 

Formula:

 
Precision = TP / (TP + FP)

 

 Simple Meaning:

When the model says YES, how often is it correct?

 

 Example (Spam Email):

  • Emails marked as spam = 20

  • Actually spam = 15

Precision = 15 / 20 = 75%

 

 When Precision Is Important?

  • Spam detection

  • Fraud alerts

  • Legal cases

 You want fewer false alarms.

Recall (Sensitivity)

Recall tells us how many actual positives were correctly detected.

 

 Formula:

 
Recall = TP / (TP + FN)

 

Simple Meaning:

Out of all real YES cases, how many did the model catch?

 

Example (Disease Detection):

  • Actual sick patients = 50

  • Correctly detected = 45

Recall = 90%

 

When Recall Is Important?

  • Disease detection

  • Fraud detection

  • Safety systems

 You don’t want to miss real cases.

F1 Score

F1 Score is the balance between Precision and Recall.

 

Formula:

 
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

 

 Simple Meaning:

One number that considers both false positives and false negatives.

 

When to Use F1 Score?

  • When data is imbalanced

  • When both Precision & Recall are important

Accuracy vs Precision vs Recall vs F1

MetricWhat It Focuses OnBest Used When
AccuracyOverall correctnessBalanced data
PrecisionAvoid false positivesSpam, fraud alerts
RecallAvoid false negativesDisease detection
F1 ScoreBalance of precision & recallImbalanced data

Real-Life Example

Disease Test

  • Accuracy → Overall test correctness

  • Precision → If test says sick, is patient really sick?

  • Recall → Did we catch all sick patients?

  • F1 Score → Balance between false alarms and missed cases

In healthcare → Recall is most important
In spam detection → Precision is most important

Python Example (Using Scikit-learn)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = [1, 0, 1, 1, 0, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))