Image credits: Image Classification by Oleksandr Panasovskyi from the Noun Project; scikit learn logo, scikit learn
"I wish my machine-learning model performed better" (said everyone everywhere). I found myself in that position not too long ago.
I had trained an NLP classification model to distinguish between pieces of writing associated with substance use from those associated with recovery from addiction. This is a trickier challenge than it sounds, because we are not considering two completely separate communities, but rather, one -- because relapse is a typical part of the recovery process, and there is more than 60% shared vocabulary between substance users and those in recovery (according to the statistics output by my model, at any rate).
The results I got from training six different models were fairly typical for such a basic machine learning experiment; models were (sometimes dramatically) overfit, a few were actually underfit, two of them performed particularly well. But a 90%+ accuracy score still left me hoping for better. Was there some way, I idly wondered, that I could take the best from each of my models and contribute them toward a top-performing meta-best estimator?
If you're reading this and you have more time in the machine-learning trenches than me (which is quite likely unless you are in Grade 4, and even then, maybe!), you may be thinking to yourself, "Um, yeah Bozo, don't you know about stacked generalization?"
And you would, indeed, be right. I was not familiar with the concept. But since scikit learn's 0.20.0 update and the release of StackingRegressor and StackingClassifier, now I do! Stacked generalization is an emsemble machine-learning algorithm. I'm going to discuss it briefly here before showing a simple implementation example.
Machine learning expert Dr Jason Brownlee describes the concept and its benefits succinctly. According to him, stacked generalization "uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms." He adds:
"The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble."
Sounds great, doesn't it? My colleagues and I have spent much time building pipelines to transform data and pass it on to a classifier or regressor in a way that is repeatable. However, each of those models was still siloed. But there is another way. The sci-kit learn documentation notes, however, that the performance of a stacked set of models is not always superior:
"It is sometimes tedious to find the model which will best perform on a given dataset. Stacking provide an alternative by combining the outputs of several learners, without the need to choose a model specifically. The performance of stacking is usually close to the best model and sometimes it can outperform the prediction performance of each individual model. "
Sounds worth a try to me! What's the drawback? Using a stacked classifier or regressor is more computationally expensive, depending on the number of models you feed to the final estimator. In the case of my comparatively simple classification model (which involved about 8,000 observations), the whole procedure took perhaps five minutes.
Let's see what that meta-model configuration could look like. The syntax is pretty simple. First, a few imports relevant to my experiment. Note the StackingClassifier that forms part of the third import:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer, LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
When you instantiate the StackingClassifier, it is looking for two things. First, a list of the estimators you want to learn from your training data. Second, the final classifier, which is passed as its input all of the output from the other estimators that you first specified. Note that, in the code snippet below, the estimators selected (and their tuned hyperparameters) are specific to the needs of my experiment. You can pass your final estimator any set of models that makes sense for your data and use case. As mentioned above, for regression problems, you can use the RegressionClassifier.
estimators = [('rf', RandomForestClassifier(
min_samples_leaf=3,
min_samples_split=4,
n_estimators=262,
n_jobs=-1,
random_state=42)),
('dt', DecisionTreeClassifier(
max_depth=7,
min_samples_leaf=6,
min_samples_split=7,
random_state=42)),
('svc', SVC(C=10, gamma=0.1, kernel='rbf', random_state=42))]
classifier = StackingClassifier(estimators=estimators, final_estimator=GaussianNB())
You know the rest of the drill. Set up your train/test split and fit the whole StackingClassifier against your data:
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.25, random_state = 42, stratify = y)
classifier.fit(X_train, y_train)
How did it do?
# Compare the training and test accuracy scores
pprint(f'Score on training set: {classifier.score(X_train, y_train)}')
pprint(f'Score on testing set: {classifier.score(X_test, y_test)}')
'Score on training set: 0.902265659706797'
'Score on testing set: 0.7949400798934754'
In this specific case, not so great. The final estimator's prediction indicate that it was quite overfit on the training data. My original Naive Bayes Gaussian model had actually performed better on its own. That said, this was just a quick trial run of StackingClassifier for the purposes of this blog post. I'm sure that on the right project or with some experimentation to find the right combination of first estimators, someday soon, stacked generalization may just save the day for me!