2010 EnsemblebasedClassifiers

(Rokach, 2010) ⇒ Lior Rokach. (2010). “Ensemble-based Classifiers.” In: Artificial Intelligence Review Journal, 33(1-2). doi:10.1007/s10462-009-9124-7

Subject Headings: Ensemble Training.

Notes

Cited By

Quotes

Author Keywords

Boosting; Classification; Ensemble of classifiers; Supervised learning

Abstract

The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used for improving prediction performance. Researchers from various disciplines such as statistics and AI considered the use of ensemble methodology. This paper, review existing ensemble techniques and can be served as a tutorial for practitioners who are interested in building ensemble based systems.

The purpose of supervised learning is to classify patterns (also known as instances) into a set of categories which are also referred to as classes or labels. Commonly, the classification is based on a classification models (classifiers) that are induced from an exemplary set of preclassified patterns. Alternatively, the classification utilizes knowledge that is supplied by an expert in the application domain.

In a typical supervised learning setting, a set of instances, also referred to as a training set is given. The labels of the instances in the training set are known and the goal is to construct a model in order to label new instances. An algorithm which constucts the model is called inducer and an instance of an inducer for a specific training set is called a classifier.

The main idea behind the ensemble methodology is to weigh several individual classifiers, and combine them in order to obtain a classifier that outperforms every one of them. In fact, human being tends to seek several opinions before making any important decision. We weigh the individual opinions, and combine them to reach our final decision (Polikar 2006).

Marie Jean Antoine Nicolas de Caritat, marquis de Condorcet (1743.1794) was a French mathematician who among others wrote in 1785 the Essay on the Application of Analysis to the Probability of Majority Decisions. This work presented the well-known Condorcet.s jury theorem. The theorem refers to a jury of voters who need to make a decision regarding a binary outcome (for example to convict or not a defendant). If each voter has a probability p of being correct and the probability of a majority of voters being correct is L then: . p > 0.5 implies L > p . Also L approaches 1, for all p > 0.5 as the number of voters approaches infinity.

This theorem has two major limitations: the assumption that the votes are independent; and that there are only two possible outcomes. Nevertheless, if these two preconditions are met, then a correct decision can be obtained by simply combining the votes of a large enough jury that is composed of voters whose judgments are slightly better than a random vote.

Originally, the Condorcet Jury Theorem was written to provide a theoretical basis for democracy. Nonetheless, the same principle can be applied in supervised learning. A strong learner is an inducer that is given a training set consisting of labeled data and produces a classifier which can be arbitrarily accurate. A weak learner produces a classifier which is only slightly more accurate than random classification. The formal definitions of weak and strong learners are beyond the scope of this paper. The reader is referred to Schapire (1990) for these definitions under the PAC theory.

One of the basic question that has been investigated in ensemble learning is:. can a collection of weak classifiers create a single strong one ?.. Applying the Condorcet Jury Theorem insinuates that this goal might be achieved. Namely, construct an ensemble that (a) consists of independent classifiers, each of which correctly classifies a pattern with a probability of p > 0.5; and (b) has a probability of L > p to jointly classify a pattern to its correct class.

Sir Francis Galton (1822.1911) was an English philosopher and statistician that conceived the basic concept of standard deviation and correlation. While visiting a livestock fair, Galton was intrigued by a simple weight-guessing contest. The visitors were invited to guess the weight of an ox. Hundreds of people participated in this contest, but no one succeeded to guess the exact weight: 1, 198 pounds. Nevertheless, surprisingly enough, Galton found out that the average of all guesses came quite close to the exact weight: 1, 197 pounds. Similarly to the Condorcet jury theorem, Galton revealed the power of combining many simplistic predictions in order to obtain an accurate prediction.

James Michael Surowiecki, an American financial journalist, published in 2004 the book. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations.. Surowiecki argues, that under certain controlled conditions, the aggregation of information from several sources, results in decisions that are often superior to those that could have been made by any single individual. even experts.

Naturally, not all crowds are wise (for example, greedy investors of a stock market bubble). Surowiecki indicates that in order to become wise, the crowd should comply with the following criteria:

Diversity of opinion. Each member should have private information even if it is just an eccentric interpretation of the known facts.
Independence.Members.

opinions are not determined by the opinions of those around them.

Decentralization. Members are able to specialize and draw conclusions based on local knowledge.
Aggregation. Some mechanism exists for turning private judgments into a collective decision.

The ensemble idea in supervised learning has been investigated since the late seventies. Tukey (1977) suggests combining two linear regression models. The first linear regression model is fitted to the original data and the second linear model to the residuals. Two years later, (Dasarathy and Sheela 1979) suggested to partition the input space using two or more classifiers. The main progress in the field was achieved during the Nineties. Hansen and Salamon (1990) suggested an ensemble of similarly configured neural networks to improve the predictive performance of a single one. At the same time (Schapire 1990) laid the foundations for the award winning AdaBoost (Freund and Schapire 1996) algorithm by showing that a strong classifier in the probably approximately correct (PAC) sense can be generated by combining. weak. classifiers (that is, simple classifiers whose classification performance is only slightly better than random classification). Ensemble methods can also be used for improving the quality and robustness of unsupervised tasks.

Ensemble methods can be also used for improving the quality and robustness of clustering algorithms (Dimitriadou et al. 2003). Nevertheless, in this paper we focus on classifier ensembles.

Given the potential usefulness of ensemble methods, it is not surprising that a vast number of methods are now available to researchers and practitioners. The aim of this paper is to provide an introductory yet extensive tutorial for the practitioners who are interested in building ensemble-based classification systems.

The rest of this paper is organized as follows: In Sect. 2 we present the ingredients of ensemble-based systems. In Sect. 3 we present the most popular methods for combining the base classifiers outputs. We discuss diversity generation approaches in Sect. 4. Section 5 presents selection methods for making the ensemble compact. Hybridization of several ensemble strategies is discussed in Sect. 6. Section 7 suggests criteria for selecting an ensemble method from the practitioner point of view. Finally, Sect. 8 concludes the paper.

2 The ensemble framework

A typical ensemble method for classification tasks contains the following building blocks:

. Training set. A labeled dataset used for ensemble training. The training set can be described in a variety of languages. Most frequently, the instances are described as attribute - value vectors. We use the notation A to denote the set of input attributes containing n attributes: A = {a1,..., ai,..., an} and y to represent the class variable or the target attribute.
. Base Inducer. The inducer is an induction algorithm that obtains a training set and forms a classifier that represents the generalized relationship between the input attributes and the target attribute. Let I represent an inducer. We use the notation M = I (S) for representing a classifier M which was induced by inducer I on a training set S.
. Diversity Generator. This component is responsible for generating the diverse classifiers.
. Combiner . The combiner is responsible for combining the classifications of the various classifiers.

It is useful to distinguish between dependent frameworks and independent frameworks for building ensembles. In a dependent framework the output of a classifier is used in the construction of the next classifier. Thus it is possible to take advantage of knowledge generated in previous iterations to guide the learning in the next iterations. Alternatively each classifier is built independently and their outputs are combined in some fashion.

References

$ Zupan, Marko Bohanec, Janez Demsar, Ivan Bratko, Feature Transformation by Function Decomposition, IEEE Intelligent Systems, v.13 n.2, p.38-43, March 1998 doi:10.1109/5254.671090

}};

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 EnsemblebasedClassifiers	Lior Rokach			Ensemble-based Classifiers				10.1007/s10462-009-9124-7		2010