Evaluation Metric

For regression problems

Mean Square Error is probably the most common one.

Say y1, y2, ... yn are the actual values, y1', y2' ... yn' are the predicted values.

Then (y1-y1')^2 + (y2-y2')^2  ... (yn-yn')^2  is the total square errores, or variance between actual and prediction

Mean Square Error  MSE = 1/n * (y1-y1')^2 + (y2-y2')^2  ... (yn-yn')^2  gives the mean square error

Root Mean Sqare Eror RMSE = square_root( MSE )

R2 or R Square is another important one.

     firstly calculate the mean of y:  mean = 1/n*(y1 + y2 ... +yn)

     the total variance within the y dataset: var_total = (y1-avg)^2 + (y2-avg)^2 + ... (yn-avg)^2

     the variance between actual and prediction is again: var_residual = (y1-y1')^2 + (y2-y2')^2  ... (yn-yn')^2

    R2 = 1 - var_residual  / var_total 

   what does it mean? If prediction matches actual perfectly, then the variance between actual and prediction is 0, i.e. the var_residual is 0, so R2 = 100%, which means all the variance within the y dataset has been explained perfectly by the model. However, if not all the vriance within y has been explained, i.e. var_residual  > 0, then R2 will get a score less than 1. Therefore, R2 reflects how well the model has explained the variance within the data y.

   If a model is really poor, even worse than the baseline, then var_residual is even bigger than var_total, and then R2 is negative.



For classification problems

Precision:   sum(#predicted correctly) / sum(#predicted positives)

Recall:        sum(#predicted correctly) / sum(#actual positives)

Area Under Curve PR

draw the precision and recall curve and calculate the area beneath it. 

When precision increases, recall must decreases. The more the curve leans to the top right corner, the better.

Average Precision:  

This is exactly same as Area Under Curve PR. It's calculated as the integral of precision over recall (0-1)

Denote Pr(r) as the precision when recall is r, then

     Average Precision AP = Integral( Pr(r)dr ) for r=0-1

The integral gives the area under the precision-recall curve.

This is for ranking problems. Order the predicted instances -n by score. Denote Pr(k) as the Precision of the top k instance.

The integral form of AP can be rewritten as:

    AP = 1/n * sum(Pr(k) * Instance(k)) for k = 1 to n

Here the instance(k) = 1 if the kth instance is positive (i.e. increase recall by 1/n), and 0 otherwise.

In this way the precision is only considered when the recall is improved.

Average Precision @ K:

Sometimes you care about only the top K instances. E.g. the first 10 pages of the search results.

 AP@K = AP for k = 1 to K

Mean Average Precision:

Assume there are k queries (e.g. k searches on google). AP is the average precision for the kth query.

MAP  = sum(AP(k)) / k  

Mean Average Precision @ K:

Similarly consider only the top K instances in calculating MAP. 

Cumulative Gain (CG)

A ranking algorithm orders a list of documents as follows:

   D1, D2, D3, D4, D5, D6

Assume the actual relevance of the documents are:

   3, 2, 3, 0, 1, 2

Then the Cumulative Gain 

   CG = sum(relevance) = 3+2+3+0+1+2 = 11

Changing the order of two items in the returned list doesn't affect CG as the sum is still the same. E.g. Swapping D3 and D4 doesn't change CG.

To make sure more relevant documents ranked higher then less relevant documents, we include the rank of the document.

Discounted Cumulative Gain (DCG)

   DCG = sum(relevance_i / log(i+1)) 

where i is the rank of a document. 

So if a relevant document is ranked too low (i.e. i is big) then the impact relevance_i / log(i+1) is small.

Similarly if a less relevant document is ranked too high, then it's a waste of a high rank as impact is still low.

This metric therefore encourage to place relevant documents at high ranks and low relevant documents at low ranks.

For a classification problem where the relevance is 0 / 1, this still works as it simply encourage to place 1s at high ranks and 0s at low ranks.

Normalized Discounted Cumulative Gain (NDCG)

For different searches the number of documents may be different so we need to normalize the gain to compare fairly between searches.

Firstly, get the ideal ranking

    3, 3, 2, 2, 1, 0

And calculate the idea DCG_ideal.

Secondly calculate the actual DCG_actual given the actual ranking.

The NDCG = DCG_actual / DCG_ideal