# C4.5 Algorithm

## References

### 2011

• (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/C4.5_algorithm
• C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set $\displaystyle{ S = {s_1, s_2, ...} }$ of already classified samples. Each sample $\displaystyle{ s_i = {x_1, x_2, ...} }$ is a vector where $\displaystyle{ x_1, x_2, … }$ represent attributes or features of the sample. The training data is augmented with a vector $\displaystyle{ C = {c_1, c_2, ...} }$ where $\displaystyle{ c_1, c_2, … }$ represent the class to which each sample belongs. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
• In pseudocode, the general algorithm for building decision trees is:
1. Check for base cases
2. For each attribute a
1. Find the normalized information gain from splitting on a
3. Let a_best be the attribute with the highest normalized information gain
4. Create a decision node that splits on a_best
5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node

### 2002

• (Gabor Melli, 2002) ⇒ Gabor Melli. (2002). “PredictionWorks' Data Mining Glossary." PredictionWorks.
• C4.5: A decision tree algorithm developed by Ross Quinlan, and a direct descendant of the ID3 algorithm. C4.5 can process both discrete and continuous data and makes classifications. C4.5 implements the information gain measure as its splitting criterion and employs post-pruning. Through the 1990s it was the most common algorithm to compare results against. See ID3, Pruning, Gini.