1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. Often model parameters are estimated using an optimization algorithm, which is a type of efficient search through possible parameter values. Some examples of model parameters include [1]:
The weights in an artificial neural network.
The support vectors in a support vector machine.
The coefficients in a linear regression or logistic regression.
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. Other possible search strategies are to use a grid search or a random search, then you are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful predictions. Some examples of model hyperparameters include [1]:
The learning rate for training a neural network.
The C and sigma hyperparameters for support vector machines.
The k in k-nearest neighbors.
According to [2], Logistic regression is one of the most popular algorithms employed in classification problems.
Thus, this gives us the motivation to explore and tune the main hyperparameters in logistic regression which are:
Solver: is the algorithm to use in the optimization problem. The choices are {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’.
Penalty (or regularization): intends to reduce model generalization error, and is meant to disincentivize and regulate overfitting. Technique discourages learning a more complex model, so as to avoid the risk of overfitting. The choices are: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’.
C (or regularization strength): must be a positive float. Regularization strength works with the penalty to regulate overfitting. Smaller values specify stronger regularization and a high value tells the model to give high weight to the training data.
For further details about these hyperparameters please see [2].
Since it has a range of hyperparameters and they may interact in nonlinear ways, it is often required to search for a set of hyperparameters that result in the best performance of a model on a dataset. This is called hyperparameter optimization, hyperparameter tuning, or hyperparameter search [3].
An optimization procedure involves defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension is the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.
A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.
A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search:
Random Search. Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.
Grid Search. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.
Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovering and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.
More advanced methods are sometimes used, such as Bayesian Optimization and Evolutionary Optimization.
The scikit-learn Python open-source machine learning library provides techniques to tune model hyperparameters. Specifically, it provides the RandomizedSearchCV for random search and GridSearchCV for grid search. Both techniques evaluate models for a given hyperparameter vector using cross-validation, hence the “CV” suffix of each class name.
Both classes require two arguments. The first is the model that you are optimizing, which in our case will be Logistic regression. This is an instance of the model with values of hyperparameters set that you want to optimize. The second is the search space. This is defined as a dictionary where the names are the hyperparameter arguments to the model and the values are discrete values or a distribution of values to sample in the case of a random search.
To better explain how to set these model parameters the sonar dataset will be employed again. The sonar dataset is a standard machine learning dataset comprising 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification. The dataset involves predicting whether sonar returns indicate a rock or simulated mine. The next code:
Downloads the dataset and summarizes its shape.
Splits the dataset, with 208 rows of data with 60 input variables, into input and output elements.
# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
X, y
(array([[0.02, 0.0371, 0.0428, ..., 0.0084, 0.009, 0.0032], [0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044], [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078], ..., [0.0522, 0.0437, 0.018, ..., 0.0138, 0.0077, 0.0031], [0.0303, 0.0353, 0.049, ..., 0.0079, 0.0036, 0.0048], [0.026, 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]], dtype=object), array(['R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M'], dtype=object))
The next code will:
Define the model that will be optimized.
Define the search space through a dictionary where names are arguments to the model and values are distributions from which to draw samples.
The hyperparameters that will be optimized are: the solver, the penalty, and the C.
# define model
model = LogisticRegression()
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)
For GridSearch and RandomSearch class is necessary to provide a “cv” argument that allows either an integer number of folds to be specified, e.g. 5, or a configured cross-validation object. It is recommended to define and specify a cross-validation object to gain more control over model evaluation and make the evaluation procedure obvious and explicit.
In the case of classification tasks, I recommend using the RepeatedStratifiedKFold class, and for regression tasks, I recommend using the RepeatedKFold with an appropriate number of folds and repeats, such as 10 folds and three repeats.
Both hyperparameter optimization classes also provide a “scoring” argument that takes a string indicating the metric to optimize. The metric must be maximized, meaning better models result in larger scores. For classification, this may be ‘accuracy‘. For regression, this is a negative error measure, such as ‘neg_mean_absolute_error‘ for a negative version of the mean absolute error, where values closer to zero represent less prediction error by the model.
Finally, the search can be made parallel, e.g. use all of the CPU cores by specifying the “n_jobs” argument as an integer with the number of cores in your system, e.g. 8. Or you can set it to -1 to automatically use all of the cores in your system.
Once defined, the search is performed by calling the fit() function and providing a dataset used to train and evaluate model hyperparameter combinations using cross-validation. At the end of the search, you can access all of the results via attributes on the class. Perhaps the most important attributes are the best score observed and the hyperparameters that achieved the best score.
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
Best Score: 0.7897619047619049
Best Hyperparameters: {'C': 4.878363034905761, 'penalty': 'l2', 'solver': 'newton-cg'}
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
7080 fits failed out of a total of 15000.
The references provide a comparison between Grid Search and Random Search [3, 4, 5] which is summarized in the following figure [5].
The references [4, 5] also provide descriptions of alternative methods like:
Informed Search,
Genetic Algorithm,
Bayesian optimization.
Tree-structured Parzen estimators (TPE).
A summary of all steps necessary from data processing to model hyperparameter selection is given in the next figure [6].
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1M_DizcF662ccOfKQY_Tt4U-Hx3RTO13s?usp=sharing
[1] https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/
[2] https://medium.com/codex/do-i-need-to-tune-logistic-regression-hyperparameters-1cb2b81fca69
[3] https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/
[4] https://towardsdatascience.com/hyperparameter-tuning-in-python-21a76794a1f7
[5] https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide