## IntroductionLots of analyst misinterpret the term ‘boosting’ used in data science. Let me provide an interesting explanation of this term. Boosting grants power to machine learning models to improve their accuracy of prediction. Boosting algorithms are one of the most widely used algorithm in data science competitions. The winners of our last hackathons agree that they try boosting algorithm to improve accuracy of their models. In this article, I will explain how boosting algorithm works in very simple manner. I’ve also shared the Python codes below. I’ve skipped the intimidating mathematical derivations used in Boosting. Because, that wouldn’t have allowed me to explain this concept in simple terms. Let’s get started. What is Boosting?
Let’s understand this definition in detail by solving a problem of spam email identification: How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If: - Email has only one image file (promotional image), It’s a SPAM
- Email has only link(s), It’s a SPAM
- Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
- Email from our official domain “Analyticsvidhya.com” , Not a SPAM
- Email from known source, Not a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No. Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like: For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’. ## How Boosting Algorithms works?Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An immediate question which should pop in your mind is, ‘ To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule. Here’s another question which might haunt you, ‘ For choosing the right distribution, here are the following steps:
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classiﬁed or have higher errors by preceding weak rules. ## Types of Boosting AlgorithmsUnderlying engine used for boosting algorithms can be anything. It can be decision stamp, margin-maximizing classification algorithm etc. There are many boosting algorithms which use other types of engine such as: - AdaBoost (
**Ada**ptive**Boost**ing) - Gradient Tree Boosting
- XGBoost
In this article, we will focus on AdaBoost and Gradient Boosting followed by their respective python codes and will focus on XGboost in upcoming article.
## Boosting Algorithm: AdaBoostThis diagram aptly explains Ada-boost. Let’s understand it closely:
Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning algorithms as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problem. You can refer article “Getting smart with Machine Learning – AdaBoost” to understand AdaBoost algorithms in more detail. ## Python Codefrom sklearn.ensemble import AdaBoostClassifier #For Classification from sklearn.ensemble import AdaBoostRegressor #For Regression from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier() clf = AdaBoostClassifier(n_estimators=100, base_estimator=dt,learning_rate=1) #Above I have used decision tree as a base estimator, you can use any ML learner as base estimator if it ac# cepts sample weight clf.fit(x_train,y_train) You can tune the parameters to optimize the performance of algorithms, I’ve mentioned below the key parameters for tuning: **n_estimators:**It controls the number of weak learners.`learning_rate:``C`ontrols the contribution of weak learners in the final combination. There is a trade-off between`learning_rate`and`n_estimators`.**base_estimators**: It helps to specify different ML algorithm.
You can also tune the parameters of base learners to optimize its performance. ## Boosting Algorithm: Gradient BoostingIn gradient boosting, it trains many model sequentially. Each new model gradually minimizes the loss function (y = ax + b + e, e needs special attention as it is an error term) of the whole system using Gradient Descent method. The learning procedure consecutively fit new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct new base learners which can be maximally correlated with negative gradient of the loss function, associated with the whole ensemble. You can refer article “Learn Gradient Boosting Algorithm” to understand this concept using an example. In Python Sklearn library, we use Gradient Tree Boosting or GBRT. It is a generalization of boosting to arbitrary differentiable loss functions. It can be used for both regression and classification problems. ## Python Codefrom sklearn.ensemble import GradientBoostingClassifier #For Classification from sklearn.ensemble import GradientBoostingRegressor #For Regression clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1) clf.fit(X_train, y_train) **n_estimators:**It controls the number of weak learners.`learning_rate:``C`ontrols the contribution of weak learners in the final combination. There is a trade-off between`learning_rate`and`n_estimators`.**max_depth**: maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
You can tune loss function for better performance. ## End NoteIn this article, we looked at boosting, one of the method of ensemble modeling to enhance the prediction power. Here, we have discussed the science behind boosting and its two types: AdaBoost and Gradient Boost. We also studied their respective python codes. In my next article, I will discuss about another type of boosting algorithms which is now a days secret of wining data science competitions “XGBoost”. Did you find this article helpful? Please share your opinions / thoughts in the comments section below. |