Top 10 evaluation metrics for classification models

Sam has built a classification model to predict whether the person has a heart attack or not. Now he wants to evaluate the performance of his model. There are a few questions that arise in Sam’s mind are given as follows.

Is the model that I have built good enough to predict whether has person has a heart attack or not.
What are the various classification metrics that I can use in my dataset?
Which metrics should we give importance to in an imbalanced dataset?
How to decide which metrics to choose in a given problem.

With the help of this blog, I will try to answer all the questions which Sam asked. This blog contains many sections. The first section will discuss the various classification metrics and when to use them. The second section will implement these metrics in python programming and R programming.

Table of Contents

Confusion matrix

A confusion matrix is a table containing information about how the model performs on the test dataset.

With the help of the confusion matrix, you can calculate various metrics like precision, Recall, f1 score, and accuracy.

In the diagram below, the column header is called the actual values, and the row is the predicted values.

Four terminologies are used in the confusion matrix are

True positive– A patient has had a heart attack, and our model predicted correctly.
True negative– A patient has not had a heart attack, and our model predicted correctly.
False-positive– It is also known as Type I error. A patient does not have heart disease, and our model predicted incorrectly (it means the model predicted the patient has heart disease).
False-negative– It is also known as Type II error. A patient has heart disease, and our model predicted incorrectly (It means the model predicted the patient does not have any heart disease).

For the Binary classification

	Apple	Mango
Apple	TP	FN
Mango	FP	TN

Confusion matrix for binary classification

For the Multiclass classification

For the Apple class

Confusion matrix for multiclass classification

For the Mango class

Important points.

The diagonal column represents the true positive value of every class.
All the false negative values lie in the same row of True positive
All the false-positive values lie in the same column of True positive
And remaining all other are True negative values.

True negative rate, also knowns as Specificity

It is defined as the ratio of the actual negative prediction and the total number of negative samples. The mathematical formula to calculate the TNR is given down below.

When to use Specificity

It is helpful in that case if all the results are positive, it means all the patients have heart disease.

Classification report

It summarizes the model performance, whether it is a binary or multiclass classification problem.

Accuracy

It represents the percentage of correct predictions in the test dataset. The mathematical formula to calculate the accuracy is given down below.

100 % accuracy represents that the model is perfect in separating the classes.

But there are few warnings regarding the use of accuracy.

If the dataset has an imbalanced nature, the accuracy is not a suitable metric for evaluating the performance. Here imbalance means one class has more weightage than the other class. In other words, one class has no samples than the other.

One real-life example is fraudulent transactions. It means we have a million transaction records, out of which less than 0.1% are fraudulent. So even if you are getting 99% accuracy, then there may be chances your result is biased.

So, when to use accuracy?

Accuracy metrics are used when we have slightly imbalanced data, like in the ratio of 70:30. If you go beyond that, try to use oversampling and under-sampling techniques.

Precision

It is defined as the ratio of the positive samples that are correctly classified. The mathematical formula to calculate the precision is given down below.

When should we use precision?

This metric is used when you care more about the positive rate.

For example, a model focuses on finding a sick patient rather than a healthy patient in the healthcare sector.

From the above example, we can clearly see that precision metrics are biased toward the positive class. Hence to solve this problem, Mathews’s correlation coefficient comes into the picture.

Mathews correlation coefficient

This metric is also known as Phi coefficients, and its main job is to check the quality of the binary classification models. The mathematical equation of MCC is given down below.

The output values lie in the range from -1 to 1.

Where -1 denotes that the model is pretty bad in classifying the things. 0 denotes that the model is making a random prediction on the test dataset, and one denotes that the model is perfect in classifying the things.

The main advantage of using MCC is that it gives importance to all the confusion matrix values and whether the dataset is balanced.

Recall (True positive rate or Sensitivity)

It is defined as the ratio of the actual positive prediction and the total number of Positive samples. The mathematical formula to calculate the Recall is given down below.

When should we use Recall?

The Recall is used when we have to minimize the number of false negatives in the dataset. For example, let us say we are trying to predict whether a person has heart disease or not. We are trying to reduce the number of false negatives (i.e., patients have heart disease, but our model predicted incorrectly). In that case, we can use Recall.

Having a higher value of Recall does not mean that the model is good. In the previous example, if the model predicted that every patient has heart disease, it would be 100% recall, which is undoubtedly impossible.

F1 score

F1 score is also known as the f measure, and its main job is to maintain between precision and Recall. You can also calculate the f1 score by taking the harmonic mean of precision and Recall. The mathematical formula to calculate the f1 score is given down below.

When to use the F1 score

F1 score is used when we have an imbalanced dataset.

Log Loss

It is also known as binary cross-entropy. It measures the classifier’s performance by giving the result in the form of a probability range from (0 to 1) in the case of binary classification.

The mathematical formula to calculate the log loss is given down below.

In the above formula, p represents the probability, and y represents the actual values. The goal should be to minimize the log loss to increase the accuracy value.

When to use log loss

It should be used when the model outputs the values in probabilities like the logistics regression algorithm.

Categorical cross-entropy

This is an extended version of the log loss, which is used to measure the performance of the multiclass classification problem. The mathematical formula to calculate the categorical cross-entropy is given down below.

Where yi is the actual values and yi hat is the predicted values.

When to use categorical cross-entropy

This is used in the neural network when we have to evaluate multiclass classification problems.

AUC ROC score

It is used to measure or visualize the performance of the classification model. Here ROC represents the probability curve, and the AUC represents the measure of separability.

To plot a ROC curve, you have to put FPR values on the x-axis and TPR values on the Y-axis. The graph is given down below.

If the value of AUC is 1, then the model is 100% capable of distinguishing between the positive and the negative class.
If the value of AUC is 0.5, then the model cannot distinguish between two classes.
If the value of AUC is 0, then the model is giving the reverse results. A positive class model predicts negative, and a negative class model predicts positive.

Examples

You have given a binary classifier to identify the detected object in the rock and mine. Compute the performance of the classifier by using five suitable classification metrics. The confusion matrix of the binary classifier is given down below.

	True Rock	True Mine
True Rock	480 (TP)	2000 (FN)
True Mine	1760 (FP)	560 (TN)

Precision= 480 / (480+560) = 480 /1040 = 0.46

Recall = 480/ (480+ 2000) = 0.19

F1 score = (2* 046 * 0.19) / (0.46+0.19)) = 0.26

Accuracy = (480 + 560) / (480+2000+1760+ 560) = 1040 / 4800 =0.21

Specificity= 560 / (560+ 1760)1= 0.24

Implementation of all these metrics in Python and R

Dataset description– The dataset contains information about patients with heart disease and who do not have heart disease.

You can download the dataset from this link https://www.kaggle.com/datasets/aiaiaidavid/cardio-data-dv13032020

Python implementation

R implementation

Conclusion

In this blog, we have learned about various evaluation metrics that we use to build the classification model. We have also implemented the evaluation metrics in Python and R programming languages. Additionally, we implement evaluation metrics using the libraries and using libraries.

If you like this blog, you can share it with your friends or colleague. You can connect with me on social media profiles like Linkedin, Twitter, and Instagram.

LinkedIn – https://www.linkedin.com/in/abhishek-kumar-singh-8a6326148

Twitter- https://twitter.com/Abhi007si

Instagram- www.instagram.com/dataspoof