scikitlearn: Model selection: choosing estimators and their parametersGridsearchscikitlearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the crossvalidation score. This object takes an estimator during the construction and exposes an estimator API: >>> By default, the  Cross Validation With Parameter Tuning Using Grid SearchIn machine learning, two tasks are commonly done at the same time in data pipelines: cross validation and (hyper)parameter tuning. Cross validation is the process of training learners using one set of data and testing it using a different set. Parameter tuning is the process to selecting the values for a model's parameters that maximize the accuracy of the model. In this tutorial we work through an example which combines cross validation and parameter tuning using scikitlearn. Note: This tutorial is based on examples given in the scikitlearn documentation. I have combined a few examples in the documentation, simplified the code, and added extensive explanations/code comments. Preliminariesimport numpy as np from sklearn.grid_search import GridSearchCV from sklearn import datasets, svm import matplotlib.pyplot as plt Create Two DatasetsIn the code below, we load the # Load the digit data digits = datasets.load_digits() # View the features of the first observation digits.data[0:1] array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]]) The target data is a vector containing the image's true digit. For example, the first observation is a handwritten digit for '0'. # View the target of the first observation digits.target[0:1] array([0]) To demonstrate cross validation and parameter tuning, first we are going to divide the digit data into two datasets called # Create dataset 1 data1_features = digits.data[:1000] data1_target = digits.target[:1000] # Create dataset 2 data2_features = digits.data[1000:] data2_target = digits.target[1000:] Create Parameter CandidatesBefore looking for which combination of parameter values produces the most accurate model, we must specify the different candidate values we want to try. In the code below we have a number of candidate parameter values, including four different values for parameter_candidates = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ] Conduct Grid Search To Find Parameters Producing Highest ScoreNow we are ready to conduct the grid search using scikitlearn's # Create a classifier object with the classifier and parameter candidates clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=1) # Train the classifier on data1's feature and target data clf.fit(data1_features, data1_target) GridSearchCV(cv=None, error_score='raise', estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False), fit_params={}, iid=True, n_jobs=1, param_grid=[{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}, {'kernel': ['rbf'], 'gamma': [0.001, 0.0001], 'C': [1, 10, 100, 1000]}], pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0) Success! We have our results! First, let's look at the accuracy score when we apply the model to the # View the accuracy score print('Best score for data1:', clf.best_score_) Best score for data1: 0.942 Which parameters are the best? We can tell scikitlearn to display them: # View the best parameters for the model found using grid search print('Best C:',clf.best_estimator_.C) print('Best Kernel:',clf.best_estimator_.kernel) print('Best Gamma:',clf.best_estimator_.gamma) Best C: 10 Best Kernel: rbf Best Gamma: 0.001 This tells us that the most accurate model uses Sanity Check Using Second DatasetRemember the second dataset we created? Now we will use it to prove that those parameters are actually used by the model. First, we apply the classifier we just trained to the second dataset. Then we will train a new support vector classifier from scratch using the parameters found using the grid search. We should get the same results for both models. # Apply the classifier trained using data1 to data2, and view the accuracy score clf.score(data2_features, data2_target) 0.96988707653701378 # Train a new classifier using the best parameters found by the grid search svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(data1_features, data1_target).score(data2_features, data2_target) 0.96988707653701378 Success!  KFold Cross Validation and GridSearchCV in ScikitLearn Python is one of the most popular opensource languages for data analysis (along with R), and for good reason. With wellsupported open source libraries such as NumPy and SciPy, Python is powerful enough for mining large and complex datasets, and yet versatile enough as a generalpurpose programming language to integrate smoothly with web applications, databases, and other things. Today, we’ll be taking a quick look at the basics of KFold Cross Validation and GridSearchCV in the popular machine learning library ScikitLearn. Although this won’t be comprehensive, we will dig into a few of the nuances of using these. In using these two tools, we are seeking to address two main problems in data analysis.
Let’s start by exploring KFold Cross Validation, which is slightly simpler than GridSearchCV. We’ll call it KFCV for short. First, load up the canonical UCI digits dataset conveniently built into ScikitLearn. Here, we’ll just use the first 1000 samples (out of 1797 total). Note that our data has 64 features, corresponding to an 8×8 grid of pixels which represent the image. The labels are 0 through 9.
Printing the shapes of these two matrices x and y yields (1000, 64) and (1000,), respectively. KFold Cross Validation is used to validate your model through generating different combinations of the data you already have. For example, if you have 100 samples, you can train your model on the first 90, and test on the last 10. Then you could train on samples 180 & 90100, and test on samples 8090. Then repeat. This way, you get different combinations of train/test data, essentially giving you ‘more’ data for validation from your original data. The number of times you ‘switch around’ the train/test data is the number of folds. Therefore, 3Fold Cross Validation will yield 3 sets of train/test data, 5Fold Cross Validation will yield 5 sets, and so forth. Here’s how we set it up:
A few notes about using the above method for KFCV:
Running the above code yields ten sets of train/test data (adding the ellipsis for brevity):
It’s hard to tell here, but if you print out the above train/test data fully you’ll see that each training set has more elements than each corresponding test set. Now, if we create a ScikitLearn model as usual, we can use the returned train/test indices to see how well our model performs against KFCV’s 10 generated datasets:
This is welldocumented in the official tutorial page on estimator validation with KFCV. Running the above code gives a NumPy array of 10 floats, i.e. successful prediction scores, for each of our 10 datasets:
Alternately, you could also run the process using ScikitLearn’s preimplemented tool for scoring and validating a model,
This gives the same results as above, but (at least in an iPython notebook) the floats seem to be truncated to 2 decimal places. Moving onto the second problem mentioned at the beginning of this post, we’ll now check out GridSearchCV. This allows us to create a special model that will find its optimal parameter values. For example, one of the parameters for It’s relatively easy to get started with GridSearchCV. Let’s check out some of the example code (slightly modified) from the official tutorial:
The first line sets up a possible range of values for the optimal parameter C. The function numpy.logspace, in this line, returns 10 evenly spaced values between 0 and 4 on a log scale (inclusive), i.e. our optimal parameter will be anywhere from 10^0 to 10^4. It’s unlikely C will be on the order of 10^4, of course, but that’s another story. The second line builds our classifier. Here’s a rundown of each argument, as described in the docs:
Now, we can run crossvalidation techniques on this new GridSearchCV estimator as before:
In addition to the scores for the 10 datasets, we find a couple more attributes for our optimized model, which are the best score from our crossvalidation and the best possible value of C. Recall that an estimator’s attributes, in ScikitLearn, are expressed with a trailing underscore. For example, a Logistic Regression estimator
Finally, as before, we can run
