*If it’s easy, it’s probably wrong.*

If you’re fresh out of a data science course, or have simply been trying to pick up the basics on your own, you’ve probably attacked a few data problems. You’ve experimented with different classifiers, different feature sets, maybe different parameter sets, and so on. Eventually you’re going to think about evaluation: how well does your classifier work?

I’m assuming you already know basic evaluation methodology: you know not to test on the training set, maybe you know cross-validation, and maybe you know how to draw confidence bars or even do t-tests. That’s data science 101, and you’re probably well familiar with it by now. I’m not talking about that. What I’m talking about is something more basic, but even harder: *how should you measure the performance of your classifier?*

The obvious answer is to use **accuracy**: the number of examples it classifies correctly. You have a classifier that takes test examples and hypothesizes classes for each. On every test example, its guess is either right or wrong. You simply measure the number of correct decisions your classifier makes, divide by the total number of test examples, and the result is the accuracy of your classifier. It’s that simple. The vast majority of research results report accuracy, and many practical projects do too. It’s the default metric.

What’s wrong with this approach? Accuracy is a simplistic measure that is misleading on many real-world problems. In fact, the best way to get a painful “*But it worked in the lab, why doesn’t it work in the field?*” experience is to use accuracy as an evaluation metric.

### What’s wrong with accuracy?

To illustrate the problems, let’s start with a simple two-class domain. Every classifier for this domain sees examples from the two classes and outputs one of two possible judgments: Y or N. Given a test set and a specific classifier, you can place each decision as:

- a positive example classified as positive. This is a
**true positive**. - a positive example misclassified as negative. This is a
**false negative**. - a negative example classified as negative. This is a
**true negative**. - a negative example misclassified as positive. This is a
**false positive**.

We can create a 2×2 matrix with the columns as true classes and the rows as the hypothesized classes. It looks like this:

Each instance contributes to the count in exactly one of the cells. This is the way to think about your classifier.

Accuracy is simply the sum of the diagonals divided by the total:

What’s wrong with this?

#### The problem with class imbalance

Accuracy simply treats all examples the same and reports a percentage of correct responses. Accuracy may be fine when you’re dealing with balanced (or approximately balanced) datasets. The further you get from 50/50, the more accuracy misleads. Consider a dataset with a 99:1 split of negatives to positives. Simply guessing the majority class yields a 99% accurate classifier!

**In the real world, imbalanced domains are the rule, not the exception!** Outside of academia, I have never dealt with a balanced dataset. The reason is fairly straightforward. In many cases, machine learning is used to process populations consisting of a large number of negative (uninteresting) examples and a small number of positive (interesting, alarm-worthy) examples.

Examples of such domains are fraud detection (1% fraud is a huge problem; 0.1% is closer), HIV screening (prevalence in the USA is ~0.4%), predicting disk drive failures (~1% per year), click-through advertising response rates (0.09%), factory production defect rates (0.1%) — the list goes on. Imbalance is everywhere.

How does this complicate evaluation? An example is shown in the diagram at left. This is a churn problem—we want to recognize which customers are likely to churn (cancel their contracts) soon.

The top half shows the training population. We’ve balanced the population so half is positive, half is negative. We’ve trained two classifiers, A and B. They each make the same percentage of errors, so in this particular dataset **their accuracies are identical**, as shown by the red areas in the top half.

The bottom half shows the same two classifiers in a testing environment, where the population is now 10% (will churn) to 90% (will not churn). This is far more realistic. Notice that A is now far worse because its errors are false positives.

There are two lessons from this example. One is that you should know more about your classifiers than what accuracy tells you. The other is that, regardless of what population you train on, you should always test on the**real** population—the one you care about.

#### The problem with error costs

There is a second problem. Consider the domain of predicting click-through advertising responses. The expected rate (the class prior) of click-throughs is about 0.09%. This means that a classifier that unconditionally says, “*No, the user won’t click*” has an accuracy of 99.91%. What’s wrong with this? Strictly speaking, nothing! That’s great performance—if accuracy was what you cared about.

The problem is that you care about some classes much more than others. In this case, you care far more about those 0.09% of people who click through than the 99.91% who don’t. To put it more precisely, you care about which error you make: showing an ad to a few people who won’t click on it (a False Positive in the matrix above) isn’t so bad, but failing to show an ad to people who *would click* on it (a False Negative) is far worse. Targeting the right people for the ad is, after all, the whole reason you’re building a classifier in the first place.

Different error costs are common in many domains. In medical diagnosis, the cost of diagnosing cancer when it doesn’t exist is different from missing a genuine case. In document classification, the cost of retrieving an irrelevant document is different from the cost of missing an interesting one, and so on.

#### What about public data sets?

If you’ve worked with databases from somewhere like the UCI Dataset Repository, you may have noticed that they don’t seem to have these issues. The classes are usually fairly balanced, and when error cost information is provided, it seems to be exact and unproblematic. While these datasets may have originated from real problems, in many cases they have been cleaned up, artificially balanced, and provided with precise error costs. Kaggle competitions have similar issues—contest organizers must be able to judge submissions and declare a clear winner, so the evaluation criteria are carefully laid out in advance.

### Other evaluation metrics

By this point, you’ve hopefully become convinced that plain classification accuracy is a poor metric for real-world domains. What should you use instead? Many other evaluation metrics have been developed. It is important to remember that **each is simply a different way of summarizing the confusion matrix**.

Here are some metrics you’ll likely come across:

**true positive rate** = TP/(TP+FN) = 1 − false negative rate**false positive rate** = FP/(FP+TN) = 1 − true negative rate**sensitivity** = true positive rate**specificity** = true negative rate**positive predictive value** = TP/(TP+FP)**recall** = TP / (TP+FN) = true positive rate**precision** = TP / (TP+FP)**F-score** is the harmonic mean of precision and recall:

**G-score** is the geometric mean of precision and recall:

Two other measures are worth mentioning, though their exact definitions are too elaborate to explain here. Cohen’s Kappa is a measure of agreement that takes into account how much agreement could be expected by chance. It does this by taking into account the class imbalance and the classifier’s tendency to vote Yes or No. The Mann-Whitney U test is a general measure of separation ability of the classifier. It measures how well the classifier can rank positive instances above negative instances.

### Expected cost

Given all of these numbers, which is the right one to measure? We’ll avoid the customary Statistics answer (“*It depends*”), and answer: none of the above.

The confusion matrix contains frequencies of the four different outcomes. The most precise way to solve the problem is to use these four numbers to calculate expected cost (or equivalently, *expected benefit*). You may recall that the standard form of an expected value is to take the probability of each situation and multiply it by its value. We convert the numbers in the matrix above, then we have to the costs and benefits for each of the four. The final calculation is the sum:

*Expected cost*
= p(true positive) × "benefit" (true positive)
+ p(false positive) × "cost" (false positive)
+ p(true negative) × "benefit" (true negative)
+ p(false negative) × "cost" (false negative)

This is much more precise than accuracy since it splits apart each of the four possibilities for every decision and multiplies it by the corresponding cost or benefit of that decision. Ideally, this is the way you should evaluate a classifier if you want to reduce its performance to a single number. When you have a set of classifiers and you want to choose the best one, this method gives you a way of precisely measuring how well each classifier will perform.

### Lingering questions

If you’ve read this far, you may be thinking: “*Wait, what happens if I don’t have error costs? What if they change? What if I want to see graphs of performance? Isn’t there anything I can get besides a simple point estimate? And if estimated cost is the simple answer to everything, why did scikit-learn implement dozens of different metrics?* ”

All good questions. The next blog post, Part 2, will answer them.

A previous blog post, The Basics of Classifier Evaluation, Part 1, made the point that classifiers shouldn’t use classification accuracy — that is, the portion of labels predicted correctly — as a performance metric. There are several good reasons for avoiding accuracy, having to do with class imbalance, error costs, and changing conditions.

The next part in this series was going to go on to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. This follows the approximate track of our book, Data Science for Business, in Chapters 5 and 6. However, there are several important points to be made first. These are usually not taught in data science courses, or they are taught in pieces. In software packages, if they’re present at all they’re buried in the middle of documentation. Here I present a sequence that shows the progression and inter-relation of the issues.

## Ranking is better than classifying (or: Don’t brain-damage your classifier)

Inexperienced data scientists tend to apply classifiers directly to get labels. For example, if you use the popular Scikit-learn Python package, you’ll call `predict()`

on each instance to get a predicted class. While you *do* ultimately want to predict the class of the instance, what you want from the classifier is a *score* or *rank*. Technically, this is the posterior probability estimate *p(C|***X**) where *C* is the class and **X** is an input vector.

Nearly every classifier — logistic regression, a neural net, a decision tree, a k-NN classifier, a support vector machine, etc. — can produce a score instead of (or in addition to) a class label.1

In any case, throw out the class label and take the score. Why? Several reasons. First, many classifiers implicitly assume they should use 0.50 as the threshold between a positive and negative classification. In some cases you simply don’t want that, such as when your training set has a different class proportion than the test set.

Second, if you can get scores, then you can draw an entire *curve* of the classifier’s performance. With a class label you get only a single point on a graph; a curve has much more information. We’ll see that in the next installment.

The final reason is *flexibility*. You want to be able to rank the instances by score first, and decide later what threshold you’re going to use. Here are some scenarios in which you’d want to do that:

- Say you have a team of six fraud analysts who can handle the average workload. Today two of them called in sick so you’re at 2/3 capacity. What do you do? Using the sorted list, you just work the top 2/3 of the cases you’d normally work. So if you could normally work 100 in a day, today you work 66. This is called a
*workforce constraint*. - You get a call from a VP who tells you they’re seeing a lot of customer dissatisfaction related to excessive fraud: they’re losing lots of money from fraud, and customers are getting annoyed as well. The VP allocates money for four new temp workers. You now have 10 people (1 2/3 the capacity you usually have). Again, you use the same sorted list but you can work your way much further down than usual — you can get into the “probably not fraud but let’s check anyway” zone.

With only class labels, you wouldn’t have this flexibility. Your classifier would be *brain-damaged* (the technical term) and could only provide the prediction labels that made sense when it was trained — not when it was used.These scores are commonly called *probability estimates* (hence the method name `proba()`

in Scikit-learn). I’ll call them *scores* here and reserve the term probability estimate for another step. The reason will become clear in the final section.

### Where do these raw scores come from?

You may wonder how classifiers can generate numeric scores if they’re trained only on labeled examples. The answer is that it depends on the classifier. Here are a few examples to illustrate.

Consider logistic regression, a simple classifier which places a line (technically, a hyperplane) between different classes of points. How would you decide the score of a new, unclassified point? Intuitively, since the boundary separates the two classes, points near the boundary are probably very uncertain. As a point gets further from the boundary, its score (estimate) should increase, until a point infinitely far from the boundary has an extreme estimate (0 or 1). In fact, the logistic equation is used directly to determine a point’s score, as the following figure illustrates:

Consider a very different kind of classifier: the decision tree. A decision tree recursively subdivides the instance space into finer and finer subregions, based on the region’s *entropy* (informally: the class purity of the region). The decision tree algorithm continues to subdivide each of the regions until it is all of one class, or until further subdivision cannot be justified statistically. So how is a new instance’s score assigned in this scheme? Given a new unclassified instance, you scan down the tree starting at the root node, checking the new instance against the node’s test and taking the appropriate path, until you reach a leaf node. The leaf node determines the classification by checking the classes of the training instances that reached that leaf. The majority determines the class. To get a score, you can calculate it from the instance class counts. If there are 7 positive instances and 1 negative instance in the region, you can guess that all instances in that leaf should have scores of *7/(7+1)* = 0.875.2

Consider a *k*-nearest neighbor classifier, with *k* set to 5. Given a new, unclassified instance, *k* of its nearest neighbors are retrieved. To assign a class to the new instance, some function like majority is applied to the five neighbors. To assign a score, simply divide the number of positive instances by the total and return the fraction (there are more sophisticated ways to do it, but that’s the idea). As above, Laplace correction may be used to smooth these out.

As a final example, consider the Naive Bayes classifier. This classifier is different from the others in that it models class probabilities — the posteriors — directly to make a classification. You might expect that Naive Bayes would thus be most accurate in its probability estimates. It isn’t, for various reasons. This leads to the next section.

## Why isn’t this regression?

You’ve probably been introduced to classification and regression with the explanation that classification is used to predict labels (classes) and regression is used for numbers. So you may wonder: if we’re interested in a numeric score now, why aren’t we using regression? The answer is that the label type determines the type of problem (and technique). This still *is* a classification problem — we’re still dealing with classes so the fundamental nature hasn’t changed — but now we want probabilities of class membership instead of just labels. Put another way, the *labels* of the examples are still discrete classes, so we’re still doing classification. If instead the examples had *numbers*, we should be using regression.

## Calibration: From scores to probabilities

The last section explained how scores are derived from classifiers. In most code I’ve seen, when you ask for a probability estimate you get one of these scores. They’re even called probability estimates in the documentation. But are they?

The answer is: “Well, they’re estimates, but maybe not very good ones.”

What does it mean for a probability estimate to be good? First, it must satisfy the basic constraint of being between 0 and 1. That’s easy to do simply by bounding it artificially. More importantly, it should satisfy a frequency expectation:

**If a classifier assigns a score of ***s* to a population of instances, then fractionally about *s* of that population should truly be positive.

The process of taking a classifier and creating a function that maps its scores into probability estimates is called *calibration*. If the scores correspond well with probability estimates, that classifier is said to be *well-calibrated*.

Let’s step back and summarize. There are two separate properties of a classifier. First, how well does it *rank* (how consistently does it score positive examples above negative examples)? Second, how well *calibrated* is it (how close are its scores to probabilities)? Your instinct may tell you that one determines the other, but they’re really separate: a classifier can be good at ranking but poorly calibrated, and vice versa.

A common way to show calibration of a classifier is to graph its scores against true probabilities. This is called a *Reliability graph*. Logically, if a classifier’s estimates are accurate, its score on an instance should equal the probability that the instance is positive. But an instance’s class isn’t a probability; it’s either positive or negative. So the instances are *pooled* together based on the scores assigned to them.

A well-calibrated classifier would have all points on a diagonal line from (0,0) on the left to (1,1) on the right — or a step function approximating this. A perfect classifier (one which scores all positives as 1 and negatives as 0) would have one clump at (0,0) and another at (1,1), with nothing in between.

Here are graphs of well-calibrated and poorly-calibrated classifiers.3 Within each legend, in parentheses, is the mean-squared error of the curve against the diagonal (so smaller is better).

For example, see the graph of Gaussian Naive Bayes before and after calibration. As you can see, the raw scores are very poorly calibrated. Here GNB vastly over-estimates the true probability of examples throughout the ranges of scores, as shown by the blue line being very far above the diagonal throughout. After calibration (see the green line labeled “Gaussian Naive Bayes + Isotonic”) the scores are far better estimates.

Decision trees tend to be well calibrated,4 as you can see in in the figure. The raw scores (blue line) are not too far from the diagonal. Calibration brings the line in a little closer in to the diagonal (see “Decision Tree + Isotonic” and “Decision Tree + Sigmoid”).

If you need a calibrated classifier, the first thing to do is to look at its reliability graph and see whether it’s already fairly well calibrated. If so, just use its raw scores and skip calibration. If not, keep reading.

So what is calibration and how do you do it? Calibration involves creating a function that maps scores onto probability estimates. There are several techniques for doing this. Two of the most common are *Platt Scaling* and *Isotonic Regression*. In spite of their imposing names, both are fairly simple.

Platt Scaling (called *Sigmoid* in these graphs) involves applying logistic regression to the scores. The score of each instance is provided as input and the output is the instance’s class (0 or 1). This is equivalent to fitting the scores against the logistic (sigmoid curve) equation. After running the regression, you have a function *f(x)=ax+b* for which *a* and *b* have been solved. You simply apply *f* to a score to get a probability estimate.

Isotonic Regression is a little harder to explain, but it basically involves merging adjacent scores together into bins such that the average probability in each bin is monotonically increasing. It’s a form of linear regression that performs piecewise regression. The result is a set of bins, each of which has a score range and a probability associated with it. So the first bin may have the range *[0, 0.17]* and probability 0.15, meaning that any instance with a score between 0 and 0.17 should be assigned a probability estimate of 0.15.

### TIP:

Both of these calibration methods are available in Scikit-learn under `sklearn.calibration`

. In general, Isotonic Regression (a non-parametric method) performs better than Platt Scaling. A good rule of thumb is to use Isotonic Regression unless you have few instances, in which case use Platt Scaling. Of course, it doesn’t hurt to try both.

As a final note on calibration: *you may not need to do it at all*. Often you just want to find a good score threshold for deciding positive and negative instances. You care about the threshold and the performance you’ll get from it, but you don’t really need to know the probability it corresponds to. The next installment will show examples of that.

## What’s next

The next installment in this series will introduce performance graphs such as ROC curves, profit curves, and lift curves. Using these curves, you can choose a performance point on the curve where you want to be. You look up the score that produced that point and use it as your threshold. This is an easy way to choose a threshold that avoids calibration entirely.

For further information

“*Putting Things in Order: On the Fundamental Role of Ranking in Classification and Probability Estimation*,” an Invited Talk by Peter Flach. There are actually three properties of a classifier rather than just the two discussed here. Flach’s excellent video lecture covers everything and includes examples and algorithms.

“*Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers*,” Bianca Zadrozny and Charles Elkan.

“*Obtaining Calibrated Probabilities from Boosting*,” Alexandru Niculescu-Mizil and Rich Caruana.

“*Class probability estimates are unreliable for imbalanced data (and how to fix them)*,” Byron C. Wallace and Issa J. Dahabreh. 12th IEEE International Conference on Data Mining, ICDM 2012.

Code for reliability curves was adapted from here.

1. With Scikit-learn, just invoke the **proba**() method of a classifier rather than **classify**(). With Weka, click on More options and select the Output predictions box, or use -p on the command line.

2. It’s common to perform some sort of Laplace correction to smooth out the estimates, to avoid getting extreme values when only a few examples are present. Smoothed, this score would be (7+1) / (7+1+2) = 0.80.

3. These graphs were all made using the Adult dataset from the UCI repository.

4. In fact, decision trees should be perfectly calibrated because of how they produce scores. In a decision tree, a leaf with p positives and n negatives produces a score of p/(p+n). This is exactly how Reliability graphs pool instances, so the pooled value x always equals the score y. So why aren’t the lines in the figure of Decision Trees before and after calibration perfectly diagonal? Simply because the training and testing sets were slightly different, as they always are.