How do various classification performance metrics compare?
Here we test the following metrics using randomly generated classifier scores:
- Accuracy
- 11-point interpolated average precision
- F1-score.
First we compare the performance against an increasing imbalance in the number of examples per class.
Secondly we compare the performance against a varying number of classes.
There are two important things to notice form the plots above:
1. The Acc and mF1 are exactly the same given the same number of examples per class, and diverge with a class imbalance.
This demonstrates the importance of using the F1 score when working with datasets in which there can be dominating and rare classes.
For example if a dataset had 100 examples of a "foo1" and only 10 examples of "foo2",
then if all 110 examples were predicted as "foo1", the accuracy would still be 91%, whilst the F1 score drops to 48%.
2. The 11-point interpolated average precision generally overestimates the average precision.
For an explanation of why it overestimates, see:
https://sanchom.wordpress.com/2011/09/01/precision-recall/
How do your performance metrics compare?