How do various classification performance metrics compare?

Here we test the following metrics using randomly generated classifier scores:

- Accuracy

- 11-point interpolated average precision

- F1-score.

First we compare the performance against an increasing imbalance in the number of examples per class.

Secondly we compare the performance against a varying number of classes.

There are two important things to notice form the plots above:

1. The Acc and mF1 are exactly the same given the same number of examples per class, and diverge with a class imbalance.

This demonstrates the importance of using the F1 score when working with datasets in which there can be dominating and rare classes.

For example if a dataset had 100 examples of a "foo1" and only 10 examples of "foo2",

then if all 110 examples were predicted as "foo1", the accuracy would still be 91%, whilst the F1 score drops to 48%.

2. The 11-point interpolated average precision generally overestimates the average precision.

For an explanation of why it overestimates, see:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html#fig:precision-recall

https://sanchom.wordpress.com/2011/09/01/precision-recall/

How do your performance metrics compare?