Cross Validation and Grid Search

Usually a training dataset is split into a training set and a test set.

A data model is trained using the training set, and then the model performance is confirmed with the test set.

However, if the data model has super parameters, such as the C and gamma in SVC, max depth in tree-based models,

one often tunes the super parameters and checks how the model performs with the test set, the parameters that achieve the best performance is normally adopted.

In this way, the parameters are derived kind of from the test set as well. i.e. the test set info 'leaks' into the training phase, 

which potentially causes overfitting.

The solution is to split the training dataset into 3 sets, training, validation and test.

The model is trained, and then validated with the validation set. The best super parameters are also selected through grid search/other ways.

When happy with the model, it's finally tested with the test set.

This solution has an issue. The training dataset is usually not big enough (never big enough), so separating it into another validation set further 

reduce the amount of data available for fiting the model. The idea is cross validation.

Cross validation firstly holds out a final test set.

The rest of the training data is split into k folds (k equal sets). In every of the k iterations, k-1 folds of the data is used for training and the 1

fold of the data is used for validation. After the iterations, the performance (e.g. AUC, precision, etc) from all the iterations are averaged as the

final performance of the model (for choosing the best super parameters).

Once the super parameters are chosen, simply re-trained the model with all the k folds data to get the model parameters (e.g. weights of the features)

The benefit of cross validataion is no extra validataion set is required for choosing super parameters.

After the super parameters are chosen and the model is retrained with all the k fold data, the model performance still need to be confirmed with the held out test set.

-------------------------

GridSearch Cross Validation

The Python grid search library provides a handy way to enumerate through a set of parameter combinations and record the results, so it is easier to find 

the best parameters for a model. The following uses SVC as an example:

...

from sklearn.grid_search import GridSearchCV

...

parameters = {'gamma':np.arange(0.005, 0.01, 0.001), 'C':[0.5,1,2]} #search through the gamma and C parameter space

svc = svm.SVC()  #create an estimator object

clf = GridSearchCV(svc, parameters, scoring = 'roc_auc') #specify the scoring function. there could be roc, precision, f1, etc. check the sklearn lib

clf.fit(X, y) #train the model with features X and target y. Here it tries all parameter combinations

#the scores and best score

clf.grid_scores_

clf.cv_results_

clf.best_estimator_

clf.best_params_ 

#clf.Predict uses the best model for predicting

clf.Predict(Z)