# C4.5 Algorithm

Jump to navigation
Jump to search

A C4.5 algorithm is a classification tree training algorithm that uses an Information Gain impurity function as a decision tree splitting criterion.

**Context:**- It was a direct descendant of the ID3 algorithm.
- It implements the Information Gain Measure as its Branch Splitting Heuristic.
- It employs Post-Pruning.
- It is implemented in the C4.5 System.
- …

**Counter-Example(s):****See:**Decision Tree Pruning, Gini Index, Ross Quinlan.

## References

### 2011

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/C4.5_algorithm
**C4.5****is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set [math]\displaystyle{ S = {s_1, s_2, ...} }[/math] of already classified samples. Each sample [math]\displaystyle{ s_i = {x_1, x_2, ...} }[/math] is a vector where [math]\displaystyle{ x_1, x_2, … }[/math] represent attributes or features of the sample. The training data is augmented with a vector [math]\displaystyle{ C = {c_1, c_2, ...} }[/math] where [math]\displaystyle{ c_1, c_2, … }[/math] represent the class to which each sample belongs. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.**- In pseudocode, the general algorithm for building decision trees is:
- Check for base cases
- For each attribute
*a*- Find the normalized information gain from splitting on
*a*

- Find the normalized information gain from splitting on
- Let
*a_best*be the attribute with the highest normalized information gain - Create a decision
*node*that splits on*a_best* - Recurse on the sublists obtained by splitting on
*a_best*, and add those nodes as children of*node*

### 2009

- (Wu & Kumar, 2009) ⇒ Xindong Wu, and Vipin Kumar, editors. (2009). “The Top Ten Algorithms in Data Mining.” Chapman & Hall. ISBN:1420089641

### 2002

- (Gabor Melli, 2002) ⇒ Gabor Melli. (2002). “PredictionWorks' Data Mining Glossary." PredictionWorks.
- C4.5: A decision tree algorithm developed by Ross Quinlan, and a direct descendant of the ID3 algorithm. C4.5 can process both discrete and continuous data and makes classifications. C4.5 implements the information gain measure as its splitting criterion and employs post-pruning. Through the 1990s it was the most common algorithm to compare results against. See ID3, Pruning, Gini.

### 1996

- (Quinlan, 1996) ⇒ J. Ross Quinlan. (1996). “Improved Use of Continuous Attributes in C4.5.” In: Journal of Artificial Intelligence Research, 4.

### 1993

- (Quinlan, 1993a) ⇒ J. Ross Quinlan. (1993). “C4.5: Programs for machine learning." Morgan Kaufmann. ISBN:1558602380