Understanding Accuracy, Recall, Precision, F1 Scores, and Confusion Matrices

This article also includes ways to display your confusion matrix

Introduction

Accuracy, Recall, Precision, and F1 Scores are metrics that are used to evaluate the performance of a model. Although the terms might sound complex, their underlying concepts are pretty straightforward. They are based on simple formulae and can be easily calculated.

This article will go over the following wrt to each term

Explanation
Why it is relevant
Formula
Calculating it without sklearn
Using sklearn to calculate it

At the end of the tutorial, we will go over confusion matrices and how to present them. I have provided the link to the google colab at the end of the article.

Data 📈

Let’s assume we are classifying whether an email is spam or not

We will have two arrays, the first array will store the actual value while the second array will store the predicted values. These predicted values are obtained from a classifier model. The type of the model is not important, we are interested in the predictions our model made.

# Actual Value
labels = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1]
# Predicted Value
predictions = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0]

0– email is NOT spam (negative)

1– email IS spam (positive)

Key Terms 🔑

True Positive ➕ ➕

This case occurs when the label is positive and our predicted value is positive as well. In our scenario, when the email is spam and our model classified it as spam as well.

TP = 0
for i in range(0,len(labels)):
    if labels[i] == predictions[i] and labels[i] == 1:
       TP+=1
print("True Positive: ", TP) # 3

False Positive ➖ ➕

This case occurs when the label is negative but our model’s prediction is positive. In our scenario, when the email is not spam but our model classifies it as spam.

FP = 0
for i in range(0,len(labels)):
    if labels[i] == 0 and predictions[i] == 1:
       FP+=1
print("False Positive: ", FP) # 3

True Negative ➖ ➖

This is similar to True Positive, the only difference being the label and predicted value are both negative. In our scenario, when the email is not spam and our model classifies it as not spam as well.

TN = 0
for i in range(0,len(labels)):
    if labels[i] == predictions[i] and labels[i] == 0:
       TN+=1
print("True Negative: ", TN) # 0

False Negative ➕ ➖

This case occurs when the label is positive but the predicted value is negative. In a way, opposite of False Positive. In our scenario, when the email is spam but our model classifies it as not spam.

FN = 0
for i in range(0,len(labels)):
    if labels[i] == 1 and predictions[i] == 0:
       FN+=1
print("False Negative: ", FN) # 4

Correct Prediction 💯

The only condition for this case is that label and the prediction value are the same. In our case, when the model classifies a spam email as spam and a non-spam email as non-spam.

The correct prediction can also be calculated as the sum of True Positives and True Negatives

CP = 0
for i in range(0,len(labels)):
    if labels[i] == predictions[i]:
       CP+=1
print("Correct Prediction: ", CP) # 3
print(CP == TP + TN) # True

Incorrect Prediction ❎

The condition for this case is that the label and the prediction value must not be equal. In our scenario, an incorrect prediction is when our model classifies a spam email as not spam and a non-spam email as spam.

Incorrect Prediction can also be calculated as the sum of the False positives and False Negatives

ICP = 0
for i in range(0,len(labels)):
    if labels[i] != predictions[i]:
       ICP+=1
print("Incorrect Prediction: ", ICP)# 7
print(ICP == FP + FN) # True

Accuracy 🎯

Accuracy is the ratio of correct predictions to the total number of predictions. It is one of the simplest measures of a model. We must aim for high accuracy for our model. If a model has high accuracy, we can infer that the model makes correct predictions most of the time.

Without Sklearn

accuracy = (TP + TN)/(TP + FP + TN + FN)
print(accuracy*100)

With Sklearn

from sklearn.metrics import accuracy_score
print(accuracy_score(labels , predictions)*100)

Recall 📲

A case when Accuracy can be misleading

High accuracy can sometimes be misleading. Consider the below scenario

labels = [0,0,0,0,1,0,0,1,0,0]
predictions = [0 ,0 ,0 ,0 ,0 , 0 ,0 ,0 ,0 ,0]
print(accuracy_score(labels , predictions)*100) # 80

A spam email is rare compared to a non-spam email. As a result, the number of occurrences with label = 0 is higher than that of label = 1. In the above, code, our labels array has 8 non-spam emails and 2 spam emails. If our model is built in a way that it always classifies an email as non-spam, it will achieve an accuracy of 80%. This is highly misleading since our model is basically unable to detect spam emails.

Calculating Recall Score

Recall calculates the ratio of predicted positives to the total number of positive labels.

In our above case, our model will have a recall of 0 since it had 0 True Positives. This tells us that our model is not performing well on spam emails and we need to improve it.

Without Sklearn

recall = (TP)/(TP+FN)
print(recall*100)

With Sklearn

from sklearn.metrics import recall_score

print(recall_score(labels,predictions))

Precision 🐾

A Case when Recall Score can be misleading

A high recall can also be highly misleading. Consider the case when our model is tuned to always return a prediction of positive value. It essentially classifies all the emails as spam

labels = [0,0,0,0,1,0,0,1,0,0]
predictions = [1,1,1,1,1,1,1,1,1,1]
print(accuracy_score(labels , predictions)*100)
print(recall_score(labels , predictions)*100)

Although the above case would have low accuracy (20%), it would have a high recall score (100%).

Calculating Precision

Precision is the ratio of the correct positive predictions to the total number of positive predictions

In the above case, the precision would be low (20%) since the model predicted a total of 10 positives, out of which only 2 were correct. This tells us that, although our recall is high and our model performs well on positive cases, i.e spam emails, it performs badly on non-spam emails.

The reason our accuracy and precision are equal is since the model is predicting all positives. In the real world, a model would correctly predict some of the negative cases leading to higher accuracy. However, the precision would still remain unchanged since it only depends on the correct positive predictions and total positive predictions

Without Sklearn

precision = TP/(TP+FP)
print(precision)

With Sklearn

from sklearn.metrics import precision_score
print(precision_score(labels,predictions)*100)

F1 Score 🚗

F1 score depends on both the Recall and Precision, it is the harmonic mean of both the values.

We consider the harmonic mean over the arithmetic mean since we want a low Recall or Precision to produce a low F1 Score. In our previous case, where we had a recall of 100% and a precision of 20%, the arithmetic mean would be 60% while the Harmonic mean would be 33.33%. The Harmonic mean is lower and makes more sense since we know the model is pretty bad.

AM = (1 + 0.2)/2
HM = 2*(1*0.2)/(1+0.2)
print(AM)# 0.6
print(HM)# 0.333

Without Sklearn

f1 = 2*(precision * recall)/(precision + recall)
print(f1)

With Sklearn

from sklearn.metrics import f1_score
print(f1_score(labels, predictions))

Confusion Matrix ❓

A confusion matrix is a matrix to represent the number of True Positives, False Positives, True Negatives, and False Negatives

Assume we are working with the following data

# Actual Value
labels = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1]
# Predicted Value
predictions = [0, 0, 1, 1, 1, 0, 1, 0, 1, 0]

Calculating Confusion Matrix using sklearn

from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(labels, predictions)
FN = confusion[1][0]
TN = confusion[0][0]
TP = confusion[1][1]
FP = confusion[0][1]

You can also pass a parameter normalize to normalize the calculated data.

Displaying Confusion Matrix as Bar Graph

plt.bar(['False Negative' , 'True Negative' , 'True Positive' , 'False Positive'],[FN,TN,TP,FP])
plt.show()

Displaying Confusion Matrix as Heatmap

import seaborn as sns
sns.heatmap(confusion , annot=True , xticklabels=['Negative' , 'Positive'] , yticklabels=['Negative' , 'Positive'])
plt.ylabel("Label")
plt.xlabel("Predicted")
plt.show()

Displaying Confusion Matrix using Pandas

import pandas as pd
data = {'Labels' : labels, 'Predictions': predictions}
df = pd.DataFrame(data, columns=['Labels','Predictions'])
confusion_matrix = pd.crosstab(df['Labels'], df['Predictions'], rownames=['Labels'], colnames=['Predictions'])
print (confusion_matrix)

Using Sklearn to generate Classification Report 👔

from sklearn.metrics import classification_report
print(classification_report(labels,predictions))

Below is the output

Conclusion

Accuracy alone can not determine if a model is good or bad but accuracy combine with precision, recall, and F1 Score can give a good idea about the performance of the model.

Link to Google Colab

Google Colaboratory

Check out my article on Bias-Variance TradeOff

Bias, Variance and How they are related to Underfitting, Overfitting