Classification metrics help us evaluate model performance on classification tasks like predicting spam emails or diagnosing diseases. They help us determine if our model performs well.
By the end, you'll understand:
- Confusion Matrix and its interpretation.
- Accuracy, Precision, and Recall.
- F1-score
- How to compute these metrics using
Python
andSciKit Learn
.
Let's dive in!
A Confusion Matrix describes the performance of a classification model. In the context of a confusion matrix, a positive prediction is predicting the class labeled 1
, and a negative prediction is predicting the class labeled 0
. The confusion matrix is a 2x2 table (for binary classification) that shows:
- True Positives (TP): A number of the correct positive predictions.
- True Negatives (TN): A number of the correct negative predictions.
- False Positives (FP): A number of the incorrect positive predictions.
- False Negatives (FN): A number of the incorrect negative predictions.
Imagine that you need to classify emails as spam (1) or not-spam (0). Let's define an example of predictions and then create our confusion matrix:
Python1import numpy as np 2from sklearn.metrics import confusion_matrix 3 4# Sample classification dataset 5y_true = np.array([0, 1, 0, 1, 0, 1, 1, 0, 1, 0]) # True labels 6y_pred = np.array([1, 1, 1, 1, 0, 0, 1, 0, 1, 0]) # Predicted labels 7 8# Calculating confusion matrix 9conf_matrix = confusion_matrix(y_true, y_pred) 10print(f"Confusion Matrix:\n{conf_matrix}")
Output:
1Confusion Matrix: 2[[3 2] 3 [1 4]]
This tells us:
- True Positives (TP):
4
(model correctly predicted spam four times) - True Negatives (TN):
3
(model correctly predicted not spam three times) - False Positives (FP):
2
(model incorrectly predicted spam two times) - False Negatives (FN):
1
(model incorrectly predicted not spam one time)
Note that the values in the confusion matrix are stored this way:
1[[TN FP 2 FN TP]]
Accuracy is the ratio of correctly predicted instances out of all instances. It's useful but can be misleading for imbalanced datasets.
Formula:
Let's compute accuracy using SciKit Learn
:
Python1from sklearn.metrics import accuracy_score 2 3# Calculating accuracy 4accuracy = accuracy_score(y_true, y_pred) 5print(f"Accuracy: {accuracy}")
Output:
1Accuracy: 0.7
Our model is 70% accurate. But sometimes accuracy alone isn't enough. Accuracy can be deceptive in imbalanced datasets where one class significantly outnumbers the other. For example, if 95% of emails are not spam and only 5% are spam, a model that never classifies any email as spam would still be 95% accurate. Thus, accuracy doesn't always reflect the real performance on minority classes.
In such cases of imbalanced data, we need to use other metrics. Let's look at the other options: the Precision and Recall metrics.
Precision is the ratio of correctly predicted positive cases out of all predicted positives. It's crucial when false positives are costly (e.g., spam detection).
Formula:
Here's how to calculate precision:
Python1from sklearn.metrics import precision_score 2 3# Calculating precision 4precision = precision_score(y_true, y_pred) 5print(f"Precision: {precision}")
Output:
1Precision: 0.67
67% of instances predicted as spam were actually spam.
Use precision when the cost of false positives is high. This metric is crucial in scenarios where the consequences of incorrectly predicting a positive are significant. Example: In spam detection, marking an important email as spam (a false positive) can result in the user missing critical information. Therefore, we prioritize obtaining a high precision to minimize false positives.
Recall is the ratio of correctly predicted positive cases out of all actual positives. It's essential when false negatives are costly (e.g., disease detection).
Formula:
Let's compute recall:
Python1from sklearn.metrics import recall_score 2 3# Calculating recall 4recall = recall_score(y_true, y_pred) 5print(f"Recall: {recall}")
Output:
1Recall: 0.8
80% of actual spam emails were correctly predicted as spam.
Use recall when the cost of false negatives is high. This metric is essential in situations where missing actual positive cases is more detrimental than having false positives. Example: In disease diagnosis, failing to identify a disease (a false negative) can have severe consequences on patient health. In such cases, we aim for high recall to ensure as many actual positive cases as possible are correctly identified.
Sometimes you want to pay attention to both Precision and Recall, finding an optimal balance between them. In this cases, we use the F1-Score metric.
F1-Score is the harmonic mean of Precision and Recall. It balances the two metrics to provide a single measure of a model's performance.
Formula:
The F1-Score is high only if both Precision and Recall are high. It's particularly useful for imbalanced datasets where a high score for one metric might be misleading without considering the other.
Here's how to calculate the F1-Score:
Python1from sklearn.metrics import f1_score 2 3# Calculating F1-Score 4f1 = f1_score(y_true, y_pred) 5print(f"F1-Score: {f1}")
Output:
1F1-Score: 0.73
An F1-Score of 0.73 indicates a good balance between Precision and Recall, offering a more comprehensive measure of the model's performance in scenarios where both false positives and false negatives are important.
We've covered:
- Confusion Matrix: Breakdown of predictions.
- Accuracy: Ratio of correct predictions.
- Precision: Correct predictions out of all positive predictions.
- Recall: Correct predictions out of all actual positives.
- F1-score: Combination of Precision and Recall
- The pitfalls of using Accuracy with imbalanced datasets.
These metrics help evaluate different aspects of your model's performance.
Now it's your turn! You'll compute classification metrics on new datasets, reinforcing your understanding. Ready to practice? Let's go!