2010 EnsembleMethodsInDataMining

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Supervised Ensemble Classification Algorithm.

Notes

Quotes

Abstract

Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithms, but this book focuses on decision trees to explain them most clearly. After describing trees and their strengths and weaknesses, the authors provide an overview of regularization -- today understood to be a key reason for the superior performance of modern ensembling algorithms. The book continues with a clear description of two recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed. REs are linear rule models derived from decision tree ensembles. They are the most interpretable version of ensembles, which is essential to applications such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.

1. Ensembles Discovered

1.2 Regularization

An influential paper was Tibshirani’s introduction of the Lasso regularization technique for linear models (Tibshirani, R., 1996). The Lasso uses the sum of the absolute value of the coefficients in the model as the penalty function and had roots in work done by Breiman on a coefficient post-processing technique which he had termed Garotte (Breiman et al., 1993).

Another important development came with the LARS algorithm by Efron et al. (2004), which allows for an efficient iterative calculation of the Lasso solution. More recently, Friedman published a technique called Path Seeker (PS) that allows combining the Lasso penalty with a variety of loss (error) functions (Friedman and Popescu, 2004), extending the original Lasso paper which was limited to the Least-Squares loss.

Careful comparison of the Lasso penalty with alternative penalty functions (e.g., using the sum of the squares of the coefficients) led to an understanding that the penalty function has two roles: controlling the “sparseness” of the solution (the number of coefficients that are non-zero) and controlling the magnitude of the non-zero coefficients (“shrinkage”). This led to development of the Elastic Net (Zou and Hastie, 2005) family of penalty functions which allow searching for the best shrinkage/sparseness tradeoff according to characteristics of the problem at hand (e.g., data size, number of input variables, correlation among these variables, etc.). The Coordinate Descent algorithm of Friedman et al. (2008) provides fast solutions for the Elastic Net.

Finally, an extension of the Elastic Net family to non-convex members producing sparser solutions (desirable when the number of variables is much larger than the number of observations) is now possible with the Generalized Path Seeker algorithm (Friedman, J., 2008).

1.4 Organization of this Book

Chapter 2 presents the formal problem of predictive learning and details the most popular nonlinear method – decision trees, which are used throughout the book to illustrate concepts. Chapter 3 discusses model complexity and how regularizing complexity helps model selection. Regularization techniques play an essential role in modern ensembling. Chapters 4 and 5 are the heart of the book; there, the useful new concepts of Importance Sampling Learning Ensembles (ISLE) and Rule Ensembles – developed by J. Friedman and colleagues – are explained clearly. The ISLE framework allows us to view the classic ensemble methods of Bagging, Random Forest, AdaBoost, and Gradient Boosting as special cases of a single algorithm. This unified view clarifies the properties of these methods and suggests ways to improve their accuracy and speed. Rule Ensembles is a new ISLE-based model built by combining simple, readable rules. While maintaining (and often improving) the accuracy of the classic tree ensemble, the rule-based model is much more interpretable. Chapter 5 also illustrates recently proposed interpretation statistics, which are applicable to Rule Ensembles as well as to most other ensemble types. Chapter 6 concludes by explaining why ensembles generalize much better than their apparent complexity would seem to allow. Throughout, snippets of code in R are provided to illustrate the algorithms described.

2. Predictive Learning and Decision Trees

3. Model Complexity, Model Selection and Regularization

4. Importance Sampling and the Classic Ensemble Method

5. Rule Ensembles and Interpretation Statistics

6. Ensemble Complexity

A. AdaBoost Equivalence to FSF Procedure

B. Gradient Boosting and Robust Loss Functions


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 EnsembleMethodsInDataMiningJohn F. Elder
Giovanni Seni
Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions10.2200/S00240ED1V01Y200912DMK0022010