Model Evaluation Metrics: Accuracy, Precision, Recall & F1 Score Explained
After building a Machine Learning model, the most important question is:
“How good is my model?”
Model Evaluation Metrics help us measure model performance and decide whether the model is reliable or not.
This article explains the four most important classification metrics:
Accuracy
Precision
Recall
F1 Score
Why Do We Need Evaluation Metrics?
A model can look good but still be wrong or misleading.
Evaluation metrics help to:
Measure correctness
Compare models
Avoid wrong business decisions
Improve model performance
First Understand This: Confusion Matrix
All these metrics are based on the Confusion Matrix.
Example: Disease Prediction (Yes/No)
| Actual \ Predicted | Yes | No |
|---|---|---|
| Yes | TP | FN |
| No | FP | TN |
What do these mean?
TP (True Positive) → Correctly predicted Yes
TN (True Negative) → Correctly predicted No
FP (False Positive) → Predicted Yes but actually No
FN (False Negative) → Predicted No but actually Yes
Accuracy
Accuracy tells us how many predictions were correct overall.
Formula:
Accuracy = (TP + TN) / (TP + TN + FP + FN) Simple Meaning:
Out of all predictions, how many were correct?
Example:
Total predictions = 100
Correct predictions = 90
Accuracy = 90%
When Accuracy Is Good?
When data is balanced
When False Positives and False Negatives have equal importance
When Accuracy Is NOT Enough?
When data is imbalanced (e.g., fraud detection)
Precision
Precision tells us how many predicted positives were actually correct.
Formula:
Precision = TP / (TP + FP) Simple Meaning:
When the model says YES, how often is it correct?
Example (Spam Email):
Emails marked as spam = 20
Actually spam = 15
Precision = 15 / 20 = 75%
When Precision Is Important?
Spam detection
Fraud alerts
Legal cases
You want fewer false alarms.
Recall (Sensitivity)
Recall tells us how many actual positives were correctly detected.
Formula:
Recall = TP / (TP + FN) Simple Meaning:
Out of all real YES cases, how many did the model catch?
Example (Disease Detection):
Actual sick patients = 50
Correctly detected = 45
Recall = 90%
When Recall Is Important?
Disease detection
Fraud detection
Safety systems
You don’t want to miss real cases.
F1 Score
F1 Score is the balance between Precision and Recall.
Formula:
F1 Score = 2 × (Precision × Recall) / (Precision + Recall) Simple Meaning:
One number that considers both false positives and false negatives.
When to Use F1 Score?
When data is imbalanced
When both Precision & Recall are important
Accuracy vs Precision vs Recall vs F1
| Metric | What It Focuses On | Best Used When |
|---|---|---|
| Accuracy | Overall correctness | Balanced data |
| Precision | Avoid false positives | Spam, fraud alerts |
| Recall | Avoid false negatives | Disease detection |
| F1 Score | Balance of precision & recall | Imbalanced data |
Real-Life Example
Disease Test
Accuracy → Overall test correctness
Precision → If test says sick, is patient really sick?
Recall → Did we catch all sick patients?
F1 Score → Balance between false alarms and missed cases
In healthcare → Recall is most important
In spam detection → Precision is most important
Python Example (Using Scikit-learn)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_true = [1, 0, 1, 1, 0, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1]
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))