Cross-Validating

If test sets provide unstable results because of sampling, the solution is to systematically sample a certain number of test sets and then average the results. It's a statistical approach (to observe many results and take an average of them), and that's the basis of cross-validation. The recipe is straightforward:

Divide your data into folds (usually 10)
Hold out one fold as a test set and use the others as training sets.
Train and record the test set result. If you have little data, it's better to use a larger number of folds, because the quantity of data and the use of additional folds positively affects the quality of training.
Perform Steps 2 and 3 again, using each fold in turn as a test set.
Calculate the average and the standard deviation of all the folds' test results. The average is a reliable estimator of the quality of your predictor. The std will tell you the predictor reliability. Expect that predictors with high variance will have a high cross-validation std.

Even though this technique may appear complicated, Scikit-learn handles it using the functions in the sklrearn.model_selection module.

Using cross-validation on k folds

To run cross-validation, you first have to initialize an iterator. KFold is the iterator that implements k folds cross-validation. There are other iterators available from the sklearn.model_selection module, mostly derived from the statistical practice, but KFold is the most widely used in data science practice.

KFold requires you to specify the n_splits number (the number of folds to generate), and indicate whether you want to shuffle the data (by using the shuffle parameter). As a rule, the higher the expected variance, the more that increasing the number of splits improves the mean estimate. It's a good idea to shuffle the data because ordered data can introduce confusion into the learning processes for some algorithms if the first observations are different from the last ones.

After setting KFold, call the cross_val_score function, which returns an array of results containing a score (from the scoring function) for each cross-validation fold. You have to provide cross_val_score with your data (both X and y) as an input, your estimator (the regression class), and the previously instantiated KFold iterator (as the cv parameter). In a matter of a few seconds or minutes, depending on the number of folds and data processed, the function returns the results.

> Cross-validating can work in parallel because no estimate depends on any other estimate. You can take advantage of the multiple cores present on your computer by setting the parameter n_jobs=-1.

Sampling stratifications for complex data

Cross-validation folds are decided by random sampling. Sometimes it may be necessary to track if and how much of a certain characteristic is present in the training and test folds in order to avoid malformed samples. For instance, the Boston dataset has a binary variable (a feature that has a value of 1 or 0) indicating whether the house bounds the Charles River. This information is important to understand the value of the house and determine whether people would like to spend more for it. You can see the effect of this variable using the following code.

A boxplot reveals that house on the river tend to have values higher than other houses. In similar situations, when a characteristic is rare or influential, you can't be sure when it is present in the sample because the folds are created in a random way. Having too many or too few of a particular characteristic in each fold implies that the machine learning algorithm may derive incorrect rules.

The StratifiedKFold class provides a simple way to control the risk of building malformed samples during cross-validation procedures. It can control the sampling so that certain features, or even certain outcomes (when the target classes are extremely unbalanced), will always be present in your folds in the right proportion. You just need to point out the variable you want to control by using the y parameter, as shown in the following code.

> Although the validation error is similar, by controlling the CHAR variable, the standard error of the estimates decreases, making you aware that the variable was influencing the previous cross-validation results.

< Prev. Lesson

Next Lesson >

Exercise 4.1

SVD on Homes Database
Using homes.csv, try to do the following:

Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.

Page updated

Google Sites

Report abuse