Tutorial Information

Ensemble methods are widely used by the machine learning community because they lead to improvements in accuracy and robustness in many prediction problems of practical interest. Furthermore, they offer appealing solutions to several interesting learning problems, such as dealing with small or large datasets, performing  data fusion and modeling concept drift. Notwithstanding, using ensemble methods also poses some challenges. Specifically, in many problems very large ensembles need to be generated to achieve good generalization performance. Similarly, the ensemble prediction is typically generated by querying all the classifiers contained in the ensemble, a process that can be computationally expensive. In this tutorial we give a brief introduction to ensemble methods for machine learning and describe ensemble pruning techniques to deal with the shortcomings described. We also introduce advanced research topics of current interest, such as the problem of determining the optimal ensemble size or techniques to make ensembles scale to large learning problems.

Ensemble Pruning

Ensemble pruning consists in the selection of a subset of the original pool of ensemble predictors without deterioration (and sometimes with an improvement) of the prediction accuracy. These methods can be used to alleviate the large computational cost of ensembles. They can be divided in two different groups: Static and dynamic pruning techniques. Static techniques focus on selecting a subset of classifiers from a given ensemble and the rest of the classifiers are discarded. Dynamic pruning techniques also reduce the number of classifiers needed to obtain the final ensemble decision, but they do this independently for each specific test instance at classification time.

Parallel Ensembles

We will analyze in detail the behavior of parallel ensembles in the asymptotic regime, i.e. when the number of classifiers tends to infinity. Parallel ensembles are characterized by the independent generation of the ensemble  classifiers, when conditioning to the training data. In particular, we will show that it is possible to estimate the ensemble size as the number of classifiers needed to obtain predictions equal to those of an ensemble composed of infinite classifiers with a predefined confidence level.

Large-Scale Ensembles

We will present recent advances of ensemble methods on large-scale data (order of gigabytes) using Hadoop(http://hadoop.apache.org/) which is a framework for distributed processing of large data sets. Additionally, future directions and trends will be presented for introducing ensemble methods for large-scale hierarchical data.

Target Audience

The goal of the tutorial is to present recent advances in ensemble learning to researchers and graduate students with a general knowledge of machine learning. However, we expect to attract scientists form the whole spectrum of the machine learning community, and from related application areas like natural language processing and image processing. Additionally, several areas of machine learning and data mining, such as multi-label learning and stream mining pose several interesting problems that can be handled by ensemble methods. Researchers involved in these topics are also expected as part of the audience.