AutoML

2/8/20

Automated machine learning

What is AutoML?

AutoML - short for Automated machine learning - is the process of automating the machine learning pipeline.

"Isn't machine learning automated anyway"?

Yes and no. Despite popular opinion, machine learning and AI (Artificial Intelligence) still requires a lot of human input. The modelling aspect of machine learning is automated as in involves finding patterns in the data. You tell the model "here is the data" and "here is the thing I want to predict (target)". The model will do many 'passes' of the data and find the best parameters/weights so that it optimizes what it thinks the target should be from the data.

What AutoML tools are available?

A lot. And the number is only likely to increase. All the cloud providers have their own AutoML (Google, Amazon and Microsoft) which they will advertise as a reason to use their cloud platform. H2O has AutoML. There are a few python packages such as: tpot (Tree-based Pipeline Optimization Tool), auto-sklean, auto-keras, and auto-pytorch. There are many more here.

What does AutoML do?

In a previous blog post I discussed the steps involved in a machine learning project. AutoML will generate features of your data, chose the model and optimize the model. The image below is taken from the tpot docs:

The part of a machine learning project that some AutoML tools will work on. Taken from the tpot docs: https://epistasislab.github.io/tpot/

How does AutoML work?*

There are a couple of different approachs. One is the lazy brute-force approach of trying every machine learning model as is done by the humorously named: HungaBunga package. However, most AutoML tools are smarter than this. Some AutoML tools will do model selection, hyperparamter tuning and ensembling. For the AutoML to be most most successful it will 'remember' which models, which hyperparamters and ensembles are giving the best results. It does this using either Bayesian methods (as is done in hyperopt [1]) or a bandit approach (as in done in hyperband [2]). Some AutoML tools go a stage further and automate feature engineering. A simple way to do this is to select features based on feature importance of a test model. More advanced AutoML tools use genetic programming which is inspired from biology. Randomly shuffled features are evaluated based on how much they improve the model. As in survival of the fittest: only the most important features are kept. The details of how this is implemented in tpot is given fully in Olson et al (2016); Le et al (2019) [3, 4].

* A good resource I used for this was this blog by Bojan Tunguz (data scientist, Nvidia)

AutoML tools used in this blog post

I will apply AutoML to the bank marketing dataset to predict if a person predict will subscribe to a term deposit. I will use two different AutoML tools. The dataset is chosen as it is the tutorial dataset for AutoML Tables and there is a example of working with this dataset for tpot.

At briefly mentioned above, the first AutoML tool I will use is tpot. tpot is open-source, integrates well with the PyData stack and has been around since 2015. It can also export the pipeline of the best model fit which is useful for interpretability and improving the model further.

Secondly, I will use Google's cloud based tool which is in beta: AutoML Tables (GCP was chosen as I still have free credits from my last blog post). One advantage of the cloud tool is that you can put the model into production at the click of a button. Unfortunately, it is not open source so it is a black box and it is not possible to reproduce the results outside of GCP.

The results of these AutoML tools can be compared to a typical machine learning workflow such as this kaggle notebook by Janio Martinez. However, i'm only focusing on using the AutoML tools in this post.

Obtaining the data

You can download the data from the UCI website:

$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip

$ unzip bank.zip

Preprocessing the data

Unfortunately AutoML is not quite there in terms of preprocessing as often to do it effectively you need some domain knowledge. In this case i'll do the preprocessing and export the file to be used for machine learning.

Read in the data, rename the response variable to 'class' and convert it to binary:

import pandas as pd

bank = pd.read_csv('bank-full.csv', delimiter=';')

bank.rename(columns={'y': 'class'}, inplace=True)

bank['class'] = bank['class'].map({'no':0, 'yes':1})

Drop features not used:

bank.drop(['marital', 'default', 'housing', 'loan', 'contact', 'poutcome', 'day'], axis=1, inplace=True)

Encode: job, education and month using the MultiLabelBinarizer and do some housekeeping for the column names:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

jobtrans = mlb.fit_transform([{str(val)} for val in bank['job'].values])

jobtrans_df = pd.DataFrame(jobtrans, columns=[bank['job'].unique()])

jobtrans_df.columns = jobtrans_df.columns.get_level_values(0)

jobtrans_df.columns = jobtrans_df.columns.str.replace('-', '')

jobtrans_df.columns = jobtrans_df.columns.str.replace('.', '')

jobtrans_df.rename(columns={'unknown': "unknown_job"}, inplace=True)

educationtrans = mlb.fit_transform([{str(val)} for val in bank['education'].values])

educationtrans_df = pd.DataFrame(educationtrans, columns=[bank['education'].unique()])

educationtrans_df.columns = educationtrans_df.columns.get_level_values(0)

educationtrans_df.rename(columns={'unknown': "unknown_education"}, inplace=True)

monthtrans = mlb.fit_transform([{str(val)} for val in bank['month'].values])

month_df = pd.DataFrame(monthtrans, columns=[bank['month'].unique()])

month_df.columns = month_df.columns.get_level_values(0)

bank.drop(['job', 'education', 'month'], axis=1, inplace=True)

bank = bank.merge(jobtrans_df, left_index=True, right_index=True)

bank = bank.merge(educationtrans_df, left_index=True, right_index=True)

bank = bank.merge(month_df, left_index=True, right_index=True)

Save the file for machine learning:

bank.to_csv('bank_for_ml.csv')

AutoML woth TPOT

Installing

$ conda create -n tpot python=3.7

$ conda activate tpot

$ conda install -c conda-forge tpot py-xgboost dask dask-ml scikit-mdr skrebate*

* Nicholas Bollweg is kindly working on tpot-full for installing tpot with additional dependencies.

Modelling

Import packages and setup the dask client:

from tpot import TPOTClassifier

import pandas as pd

from dask.distributed import Client

client = Client()

Read in the data:

df = pd.read_csv('bank_for_ml.csv')

y = df['class'].values

X = df.drop('class', axis=1).values

Split the data into training and testing:

from sklearn.model_selection import train_test_split

training_indices, testing_indices = train_test_split(df.index, stratify=y, test_size=0.2)

Setup the TPOTClassifier and change the default values for:

generations to None. This will ensure max_time_mins determines how long to run the AutoML for.
scoring to roc_auc. This is the default metric for the AutoML tables. This parameter tells the AutoML to maximize the area under the receiver operating characteristic curve (auc roc). You can also chose other parameters such as precision or recall depending on the problem.
n_jobs to -1 to use all the cores.
max_time_mins to 1440 minutes (60 * 24). The tells the AutoML how long to run for.
max_eval_time_mins to 180 minutes (60 * 3). This tells the AutoML how long to evaluate each pipeline. It can be reduced to save time evaluating complex pipelines.
use_dask to True. This uses Dask-ML's pipeline optimizations. This avoids re-fitting the same estimator on the same split of data multiple times.
verbosity to 2. This prints out the score at each generation and a progress bar.

Other parameters are explained in the appendix at the end of the blog.

tpot = TPOTClassifier(generations=None,

                      population_size=100,

                      offspring_size=None,

                      mutation_rate=0.9,

                      crossover_rate=0.1,

                      scoring='roc_auc',

                      cv=5,

                      subsample=1.0,

                      n_jobs=-1,

                      max_time_mins=1440,

                      max_eval_time_mins=180,

                      random_state=None,

                      config_dict=None,

                      template=None,

                      warm_start=False,

                      memory=None,

                      use_dask=True,

                      periodic_checkpoint_folder=None,

                      early_stop=None,

                      verbosity=2,

                      disable_update_check=False)

Fit on the data:

tpot.fit(X[training_indices], y[training_indices])

Parallel model fits by tpot as seen in the Dask dashboard.

Score the holdout data:

tpot.score(X[testing_indices], y[testing_indices])

Export the best pipeline:

tpot.export('tpot_bank_pipeline.py')

Load the pipeline to avoid running the AutoML again:

%load tpot_bank_pipeline.py

Here it shows:

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import MaxAbsScaler, StandardScaler

from xgboost import XGBClassifier

# NOTE: Make sure that the outcome column is labeled 'target' in the data file

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.9442117735375486

exported_pipeline = make_pipeline(

    StandardScaler(),

    MaxAbsScaler(),

    XGBClassifier(learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, subsample=0.9500000000000001)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

tpot gave an AUC of: 0.94

AutoML with AutoML Tables

In GCP create a project and enable the AutoML Table API.
Chose the data cloud-ml-tables-data/bank-marketing.csv from cloud storage.
Chose the deposit column as the Target.
Click Train model (it does an 80-20 split at random and this is pretty much hard coded).
Chose 24 nodes hours (same run time as the tpot model).
Remove the features similar to the tpot model: MaritalStatus, Default, Housing, Loan, Contact, POutcome, Day.
Turn off Early stopping.
Click Train Model.

Output of the AutoML Tables model.

AutoML Tables gave an AUC of: 0.9

Thoughts on AutoML

Pros

It can open up machine learning to people who cannot code. For example, if a team does not have data scientist resources and wants to test solving a problem using machine learning it is worth trying.
It can be used as a tool to help data scientists think about new techniques to incorporate into their modelling.
It is an option save time on the model selection and hyper-parameter tuning part of the machine learning workflow.

Cons

It can lead to lazy practices. It is always recommended to undertake exploratory data analysis, preprocess the data and test simple models against a baseline.
For AutoML to be effective it has to run for a long time (~at least 24 hours). This it can be very expensive. DriverlessAI often requires access to multi GPUs and days of training time.
It can be difficult to interpret the results.
It is the opposite of fast.ai which is a community that aspires to make AI open to all with minimal resources available.

References

[1] Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013). http://proceedings.mlr.press/v28/bergstra13.pdf

[2] Lisha Li and Kevin Jamieson and Giulia DeSalvo and Afshin Rostamizadeh and Ameet Talwalkar (2016) Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization, https://arxiv.org/abs/1603.06560

[3] Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H (2016) Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science Proceedings of the Genetic and Evolutionary Computation Conference, https://dl.acm.org/doi/10.1145/2908812.2908918

[4] Le, TT and Fu, W and Moore, JH (2020) Scaling tree-based automated machine learning to biomedical big data with a feature set selector, Bioinformatics, 36 (1), p250–256, https://academic.oup.com/bioinformatics/article/36/1/250/5511404

Foot notes

https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9

https://info.cnvrg.io/auto-adaptive-machine-learning

https://cloud.google.com/automl-tables/

https://ai.googleblog.com/2017/05/using-machine-learning-to-explore.html

Apendix: Tpot Classifier Model Parameters

A detailed list of the model parameters can be found here. Below I will summarize the model parameters:

generations - Number of iterations for the pipeline optimization process.
population_size - Number of individuals to retain every generation.
offspring_size - Number of offspring to produce in each generation (often same as population_size).
mutation_rate - How many pipelines to apply random changes to every generation.
crossover_rate - How many pipelines to "breed" every generation.
scoring - The scoring metric to optimize.
cv - Number o cross-validations.
subsample - Fraction of training samples that are used.
n_jobs - Number of cores to use.
max_time_mins - How many minutes to optimize the pipeline.
max_eval_time_mins - How many minutes to evaluate a single pipeline.
random_state - The seed of a random number generator to ensure tpot gives the same results.
config_dict - A configuration dictionary for customizing operators and parameters. e.g. 'TPOT sparse'.
template - Template of predefined pipeline structure to reduce computation time and provide more interpretable results.
warm_start - Reuse the results from previous calls to fit().
memory - Cache each transformer after calling fit.
use_dask - uses Dask-ML's pipeline optimizations. This avoids re-fitting the same estimator on the same split of data multiple times.
periodic_checkpoint_folder - Folder to keep check points of the pipeline in case the model crashes.
early_stop - How many generations to check whether there is no improvement in optimization process.
verbosity - How much information to display when running.
disable_update_check - Check is a new version of tpot has been released.

Google Sites

Report abuse