In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.^{[1]}^{[2]}^{[3]} Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives. Contents[hide]Overview[edit]Supervised learning algorithms are commonly described as performing the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very wellsuited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner. Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Fast algorithms such as decision trees are commonly used in ensemble methods (for example Random Forest), although slower algorithms can benefit from ensemble techniques as well. By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection. Ensemble theory[edit]An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to overfit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to overfitting of the training data. Empirically, ensembles tend to yield better results when there is a significant diversity among the models.^{[4]}^{[5]} Many ensemble methods, therefore, seek to promote diversity among the models they combine.^{[6]}^{[7]} Although perhaps nonintuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropyreducing decision trees).^{[8]} Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumbdown the models in order to promote diversity.^{[9]} Common types of ensembles[edit]Bayes optimal classifier[edit]The Bayes Optimal Classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it.^{[10]} Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes Optimal Classifier can be expressed with the following equation: where is the predicted class, is the set of all possible classes, is the hypothesis space, refers to a probability, and is the training data. As an ensemble, the Bayes Optimal Classifier represents a hypothesis that is not necessarily in . The hypothesis represented by the Bayes Optimal Classifier, however, is the optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in ).Unfortunately, the Bayes Optimal Classifier cannot be practically implemented for any but the most simple of problems. There are several reasons why the Bayes Optimal Classifier cannot be practically implemented:
Bootstrap aggregating (bagging)[edit]Main article: Bootstrap aggregating Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy.^{[11]} An interesting application of bagging in unsupervised learning is provided here.^{[12]}^{[13]} Boosting[edit]Main article: Boosting (metaalgorithm) Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified. In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to overfit the training data. By far, the most common implementation of Boosting is Adaboost, although some newer algorithms are reported to achieve better results^{[citation needed]}. Bayesian parameter averaging[edit]Bayesian parameter averaging (BPA) is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.^{[14]} Unlike the Bayes optimal classifier, Bayesian model averaging (BMA) can be practically implemented. Hypotheses are typically sampled using a Monte Carlo sampling technique such as MCMC. For example, Gibbs sampling may be used to draw hypotheses that are representative of the distribution . It has been shown that under certain circumstances, when hypotheses are drawn in this manner and averaged according to Bayes' law, this technique has an expected error that is bounded to be at most twice the expected error of the Bayes optimal classifier.^{[15]} Despite the theoretical correctness of this technique, early work showed experimental results suggesting that the method promoted overfitting and performed worse compared to simpler ensemble techniques such as bagging;^{[16]} however, these conclusions appear to be based on a misunderstanding of the purpose of Bayesian model averaging vs. model combination.^{[17]} Additionally, there have been considerable advances in theory and practice of BMA. Recent rigorous proofs demonstrate the accuracy of BMA in variable selection and estimation in highdimensional settings,^{[18]} and provide empirical evidence highlighting the role of sparsityenforcing priors within the BMA in alleviating overfitting.^{[19]} Bayesian model combination[edit]Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. The results from BMC have been shown to be better on average (with statistical significance) than BMA, and bagging.^{[20]} The use of Bayes' law to compute model weights necessitates computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire modelspace, but such is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection. The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution. The results from BMA can often be approximated by using crossvalidation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using crossvalidation to select the best ensemble combination from a random sampling of possible weightings. Bucket of models[edit]A "bucket of models" is an ensemble in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set. The most common approach used for modelselection is crossvalidation selection (sometimes called a "bakeoff contest"). It is described with the following pseudocode: For each model m in the bucket: Do c times: (where 'c' is some constant) Randomly divide the training dataset into two datasets: A, and B. Train m with A Test m with B Select the model that obtains the highest average score CrossValidation Selection can be summed up as: "try them all with the training set, and pick the one that works best".^{[21]} Gating is a generalization of CrossValidation Selection. It involves training another learning model to decide which of the models in the bucket is bestsuited to solve the problem. Often, a perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket. When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a metalearning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.^{[22]} Stacking[edit]Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a singlelayer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models.^{[23]} It has been successfully used on both supervised learning tasks (regression,^{[24]} classification and distance learning ^{[25]}) and unsupervised learning (density estimation).^{[26]} It has also been used to estimate bagging's error rate.^{[3]}^{[27]} It has been reported to outperform Bayesian modelaveraging.^{[28]} The two topperformers in the Netflix competition utilized blending, which may be considered to be a form of stacking.^{[29]} Implementations in statistics packages[edit]
See also[edit]References[edit]
Further reading[edit]
External links[edit]

Ensemble Learning
Trang con (10):
5 Easy questions on Ensemble Modeling everyone should know
AdaBoost and Gradient Boost
Basics of Ensemble Learning Explained in Simple English
Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
Ensemble  bagging, boosting, and stacking
Introduction to Boosted Trees
Learn Gradient Boosting Algorithm for better predictions (with codes in R)
Quick Introduction to Boosting Algorithms in Machine Learning
Understanding Gradient Boosting, Part 1
What is an intuitive explanation of Gradient Boosting?
Comments