Loss function and Eval metric

loss function 

The Mean Squared Error is derived from maximum likelihood based on normal distribution of noise.

Assuming actual model is determined by weights W.

Given an input X =(x1, x2, ... xn) with n features, the predicted value by the actual model is y'.

The observation y is not necessary the same as y', because of there is noise, so y deviates from y'

Now assume the noise follows a normal distribution with the mean value = y'.

the probability of observing y is:

    P(y | w) = N(y | w) where the mean of the normal distribution is y'.

There are k training samples, y1, y2, yk. The probability of observing all the k samples is:

    likelihood = P(y1| w) * P(y2| w) * ... P(yk| w) = N(y1| w) * N(y2| w) * .. N(yk| w)


The purpose is to choose a model w such that the probability N(y1| w) * N(y2| w) * .. N(yk| w) is maximized.

This is equivalent to maximizing log(N(y1| w) * N(y2| w) * .. N(yk| w)), ps: the log is a math trick

    log likelihood = log(N(y1| w) * N(y2| w) * .. N(yk| w))

Replace N(y) = 1/.. * exp(-(y-y')/d^2) into log likelihood, It will get -n(y-y')^2 ignoring constants during derivation.

Maximizing -n(y-y')^2 is equivalent to minimizing n(y-y')^2, ie. the sum of square errors.

    negative log likelihood = n(y-y')^2

Actually people started using square errors as Loss function first, but later realized it represents maximum likelihood. nice.

    Loss = n(y-y')^2


Notice that the assumption, i.e. the noise follows Normal distribution. If the noise doesn't follows normal distribution, then it's 

not ideal, but it probably still works.

If you assume another noise distribution, for example Gamma, then you may write Gamma distribution in the form of mean value (i.e. y')

and use math tricks to expend and simplify it. It finally become your new loss function assuming Gamma distribution of noise.


Eval metric

Evaluation metric is used to measure the performance of your model. 

for example, precision, area under curve.

Ideally the loss function is the same as our eval metric so the model is trained towards the best performance.

However, some eval metric such as AUC is hard to get it's derivative, so using AUC as loss function wouldn't work in the model training process, as you can not calculate the derivative / gradient.

Therefore, people have to use alternatives like mean square error to train model, and calculate the eval metric later after the model is trained.

In a grid search, there can be thousands of models trained using mean square error, but eventually you choose the model with the best AUC score for example.