The main subject in ML classification is the model accuracy to classify. There are different values we need to measure. True positive rate, true negative rate, false-positive rate, false-negative rate, Confusion matrix, Accuracy, Precision, Specificity, Recall and Sensitivity, F1 score, Support, ROC curve, and AUC are among the most important measures. All these items will be discussed.
To see the differences between the terminology that is used for logistic regression in statistics and machine learning, we remind the list of the main objectives in the statistics: Pseudo-R^2 value, Likelihood ratio test, Wald statistics, AIC, and BIC.
True Positive, TP: if an item is identified as positive by the classifier and is really positive
True Negative, TN: if an item is identified as negative by the classifier and is really negative
False Positive, FP: if an item is identified as positive by the classifier while it is negative
False Negative, FN: if an item is identified as negative by classifier while it is positive
All these numbers are demonstrated in a two by two matrix, which is known as confusion matrix and in each place, we put the number of items that are TP (up-left), FN (up-right), FP (down-left), and TN (down-right).
There are the following performance measures that we need to introduce:
Accuracy: How often classifier is correct
Precision: How many selected item is relevant
Specificity or selectivity: Proportion of actual negatives correctly identified
Recall or Sensitivity: How many relevant items are selected
F1 score: harmonic mean of precision and recall
(2×Precision×Sensitivity)/(Precision+Sensitivity)
Support: number of occurrences of each label
The confusion matrix seems to be very helpful for understanding the performance of the model on the validation (or test) set. However, there are other measures that can give further understanding of the model performance. Here we list a few.
Accuracy is simple and says how often the classifier is correct. This number can be between 0 to 100 percent. If we have two cases, then any accuracy above 50 percent can be considered since it is better than random selection.
Precision says how many selected items are relevant. In other words, it is the proportion of what we correctly identified as positive over what we in general identified as positive (subjective) i.e., proportionally making fewer errors.
Recall of sensitivity, means, how many relevant items are selected. In other words, the proportion of what we correctly identified as positive over what is, in reality, positive (objective) i.e., proportionally better reflecting the reality.
F1 score is a measure that introduces a balance between precision and sensitivity and is introduced as the harmonic mean of precision and recall.
Support is the number of occurrences of each label.
We can show the precision, specificity, and sensitivity in terms of the components of the confusion matrix. As it has been discussed earlier that the threshold of a probabilistic classifier is important in the introduction of a classifier. There are other parameters that can also have a direct and indirect impact on the classifier. However, the decision-maker can have different concerns. He/she can be more concerned with precision, specificity, or the sensitivity of the classifier. For instance, if an insurance company wants to pay a claim, it needs to be more precise or specific, so it needs a classifier to classify the claim with high precision or specificity. While in another problem considering the credit card business, in order to avoid any fraudulent transaction the banker needs to be sensitive about any fraud report.
There is always a trade-off between these measures and, for example, it is impossible to have a highly precise and highly sensitive classifier. This is based on the decision-maker's decision to decide about any measure. But if the decision-maker wants to consider a balance of the measures, she/he can look at the F1-score. But that is not the only measure. We will discuss more measures later.
The other measure to assess a classifier's performance is to draw the ROC curve. In order to find the ROC curve we have to find all (x,y) for different threshold classifier probabilities, where y is equal to the true positive rate (i.e., sensitivity or recall TPR=TP/(TP+FN)) and x is equal to the false positive rate (FPR, also known as fall-out or probability of false alarm), that is 1 minus specificity (i.e., TNR=FP/(FP+TN)=1-specificity).
As it is previously discussed, there is a trade-off between these two numbers. Any rational decision-maker would like to have maximum sensitivity and maximum specificity i.e., (x,y)=(0,1). This is important to observe that the TPR and FPR are ratios that are proportional to the real positive or negative numbers, and not what the classifier is identifying as positive or negative.
As it has been discussed, the most ideal case happens if the highest specificity and sensitivity are reached. In that case (x,y)=(0,1). So moving towards that direction is better. For that reason, if ROC curve of a model contains the other one (that means if this is closer to (0,1)), then the model is better. In this picture, the model with black ROC is better than red ROC, and the blue ROC is the worst model.
Another way to explain the ROC curve (in the general case) is as follows. Any classifier specifies two distributions: the distribution of the positive cases (black) and the distribution of the negative cases (green). As discussed earlier there should be a threshold that specifies the classifier. At any place, the threshold can specify TP, FP, FN, and TN values from the distributions. The model on the left-hand side, as it is evident, is better at splitting the positive and negative items while on one of the right-hand side the distributions have large overlaps and not suitable to split the labels. As one can see from their associated ROCs, the one on the left is prolonged more towards (0,1), which reaffirms that the left model is better.