Post date: Mar 23, 2012 3:20:04 PM
Present were:
Florin (moderator)
Alexander
Kristin
Isabelle
Hugo Jair
Gavin
We discussed:
- use of AIC, BIC, and MDL criteria in cases where CV is too computationally expensive or impractical or impossible (we still need to inventory these cases and give clear motivations)
- CV error bars
- CV splits
- learning curves
- reserving a final test set (still need to develop this aspect and not just for the simple classification case; we need to plug in statistical tests)
- different kinds of data sampling (with or without replacement; we did not talk yet about stratified CV)
- regularizing CV (including with performance bounds (Isabelle), with Bayesian priors (Gavin), by replacing the lower level by optimality conditions (Kristin))
- unsupervised learning model selection using CV on model likelihood and extension of the concept to CV with any loss function
Florin's summary:
1) the relationship between model selection in AR time series (TS) identification and in ML, where CV had been used, but was replaced by simpler and faster approaches. In the 70s and 80s CV was rather expensive!
2) I also mentioned that a model selection procedure has been expected in the TS literature to be strongly consistent, i.e. to converge to the correct model IF a linear model generated the data exponentially. This is less relevant to real world concerns (and has proven to be problematic), as we don't know the generating model (which in ML is quite the point!), but it's a nice property to have.
3) We then discussed what we really want, which is low generalization error (i.e. to learn). I argued that we don't care about variances of CV estimates per se but rather about the upper bound on expected loss (a 1-beta bound), and that without knowing the variance, we can't guarantee bounds except in certain special cases, and since we want a general toolbox, we need some estimate of that. The point of 10-fold CV is that we can average and take the standard deviation (or standard error) of the training error and chose our model based on this. Again, Bengio's paper argues, if I may say it in 6 words, abusing perhaps a poetic license "stick with the devil you know." .... but why 10 and not 12.... why 70% and not 72%...?
4) some of you said '10CV' works indeed and others said perhaps we don't need bounds anyway. A lot of this is guided by experience. Did a model search on some initial data work out in the long term? Did another fail?
5) I then suggested we use other split types, keeping test sets independent, so that we can infer both an 'expected' learning curve and its variability (by regression) to see if that is problematic, or it 'dominates' other machines learning curves and hence problem solved, TGIF!
6) there was a discussion of the other great big habit of ML, which is the 70%-90% single split, and that we don't really know where it comes from. It may only be a gut feeling, but if we all share it there must be some logic to it, flawed or not.
7) I tried to summarize the problem at hand by stating that if CV (or any type of train/test split) is to be the guiding principle, and from both a sociological point of view and ontological on it IS, then the split type dilemma is mostly a problem of dataset size. Very large sets can be split into multiple independent train/test sets and for smaller ones 10CV is unreliable... so what is to be done?
8) I also suggested that conservative estimates for validation set sizes are quite stringent and increase with number of models.
9) I finally suggested that one conceptual approach that may unify model selection in regression and classification, which are ultimately the same problem (probability density estimation) the average over splits should be expressed in terms of posterior likelihood rather than loss itself, and that nested methods imply some kind of prior.
Open questions:
- Why don't we try to reach some consensus on 'best practice' given problem type and size and then see what justifications for this consensus can be reached. Tall order, but can be attempted. If we can't fully agree on that, I think we can still all agree what model selection/validation criteria NOT to use and that we all see plenty of examples of this even at present.
- Alexander, you said you remembered a paper outlining a justification for the magical 70% to 90%, I think we'd all like to see it. I'm for one very much open to constructive arguments.
- one of the most common complaints I hear after some challenge from students is 'I was statistically tied with the winner' (luckily we know nobody can conclusively prove it). Still, we use average error over a holdout set as the DE FACTO gold standard of model selection, but cannot recommend it to practitioners at large because of some behavioral, psychological, normative constraint, by which we don't trust humans to NOT touch the validation set (i.e. careful, overfit!), and not some argument rooted in statistics and mathematics, even granted liberal use of strong assumptions.
- Gavin, you mentioned (unless I stand corrected) that we don't necessarily have to pay attention to the variance of the test error. Why is that so?
Gavin's answer:
(i) Regarding of choice of split sizes or k in k-fold cross-validation etc. Essentially the optimal split size is a compromise between the variance of the model fitting procedure and the variance of the performance estimate, and this will vary from dataset to datset. If we make the training set large, the stability of training will increase, but we will have few test data, so our performance estimate (if unbiased) will be unreliable as it has a high variance. If we have a large test set then the performance estimate will be good, but it will be biased (in the sense of being a performance estimate for a model trained on a small training set) and the variance of the model fitting will be high, so there will still be a great variation between replications with different random partitions of the data. So perhaps what we ought to do is simply choose the split that minimises the sample variance of the performance estimate over a large number of replications.
(ii) The other point to make is that once you have performed all of these replications, then you could use the sample mean of the performance estimate. This is a variance reduction method, in which case you could get away with more training data and less test data. Perhaps this is a justification for the 10-fold cross-validation, where the variance reduction from the averaging over fold makes up for the small size of the test set in each fold, giving a better compromise than any single split?
(iii) I think the point I was trying to make was not that we can ignore the variance of the performance estimate, but more questioning whether the lack of an unbiased estimator of this variance is really a practical problem. This is because the estimator of the variance itself has a variance, and if the sample size is small (i.e. exactly when you really need cross-validation) the vairiance will dominate the bias. It is a bit like the approximate unbiasedness of leave-one-out cross-validation, it is theoretically slightly reassuring, but in practice it is the variance of the leave-one-out estimator that is of practical concern. I am also not sure that unbiasedness id that helpful anyway as we are not interested in the expectation of the estimator over a large population of datasets, we would want to know the error bars on the prediction error for the particular sample of data we actually have, which suggests Bayesian approaches might be more appropriate?
(iv) In practice I generally use leave-one-out cross-validation simply because it is very cheap for the models I use, which means I can put more computational expense into proper nested performance evaluation. For the performance estimation outer loop, I generally use repeated split sample estimates (usually 100 repetitions or more if I can afford it), or Bootstrap (Yves Grandvalets comments at NIPS 2006 workshops convices me this is a good idea). For the final predictions I use an ensemble of all the models trained for the performance evaluation (which hopefully means the performance estimate is slightly pessimistically biased).
===
Comments of Isabelle and answers of Gavin:
As discussed, we do not need to impose our views to the world. Rather we can provide an enabling tool
that will allow people to easily implement what they want, with some "meaningful" default options. In
that respect, it is very useful that we share with one another what we do in practice in a variety of situations.
I certainly agree with this, the more I look into model selection, the more I find what we generally do is not very reliable, and I'd love to find out there are options I haven't heard about that work better than what I already do!
This line of reasoning is valid if what we want to do is estimate the performances of one model. But if we use
this performance to choose among several models, we have a multiple testing problem that degrades the error
bar. If instead of selecting among a discrete number of models we select among hyperparameters, we need to
evaluate the "complexity" of the level of inference to estimate the error bar. Hence in the end the fraction
training/validation set depends on the complexity of the first level of inference vs. that of the second level of >
inference.
Yes, I agree. For competitions I have normally made my final model choice based on the final performance estimate (which is a bit naughty) moderated a little by "gut feeling" (where I though the complexity of the number of choices I had made probably meant that choosing something simpler might be a good idea). However I think the idea is still valid if nested within the final performance estimation step (so the basic idea is used twice - but that would be computationally very expensive)
I agree. We don't know by how much we reduce the variance by averaging over multiple splits, but we know that we reduce it.
yes, I'm not sure that in practice on "small" datasets we can ever do anything usefully better than that :-(
My questions: Is it worth it to go through the trouble of regularizing the second level of inference? What is the best method?
I think it definitely is a good idea. To an extent the problems at every level of inference are the same, namely bias-variance trade-off. I initially though that the issue becomes less important the higher in the heirarchy you go, but my experimental results suggest that this is not the case, and that all regularisation at the first level really achieves is to shift the problem to the second level of inference!
The method I used is a start, but it doesn't really solve the problem. If we can find a method where ARD kernels are rarely worse than an RBF kernel, I think that would be a real test of the solution. Maybe that can be made the focus of a competition somehow?
What fraction training/validation do you use in your splits?
IIRC it is usually 90%/10%, but there wansn't much theory behind it. I suspect it is because I started off with 10-fold cross-validation and then rather than repeated cross-validation I used repeated split-sample because the additional randomisation seemed like a good idea.
best regards
Gavin
BTW this paper might be of interest, it describes how to perform complete cross-validation for k-NN (i.e. using all possible training/test splits IIRC)http://www.cs.cmu.edu/~rahuls/pub/icml2000-rahuls.pdf . I haven't got round to implementing it, but it liiks like a neat solution for k-NN at least.