Decision Tree (DT) Training Algorithm

Jump to navigation Jump to search

A Decision Tree (DT) Training Algorithm is a supervised eager model-based learning algorithm that accepts a tree meta-model to produce a predictor tree.



  • (Scikit-Learn, 2017) ⇒ Scikit-Learn (2007-2017) "1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART" Retrieved:2017-10-15.
    • QUOTE: What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented in scikit-learn?
      • ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.
      • C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.
      • C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.
      • CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.

        scikit-learn uses an optimised version of the CART algorithm.


  1. Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.
  2. Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.






  • (Rokach & Maimon, 2005) ⇒ Lior Rokach, and Oded Maimon. (2005). “Chapter 9. Decision Trees.” In: Data Mining and Knowledge Discovery Handbook, Editors: Oded Z. Maimon, Lior Rokach ISBN:038725465X
    • ABSTRACT: Decision Trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodologies.


  • (Breiman, 2001) ⇒ Leo Breiman. (2001). “Random Forests.” In: Machine Learning, 45(1). doi:10.1023/A:1010933404324
    • ABSTRACT: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.


  • (Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
    • tree induction algorithm: This article describes the basic tree induction algorithm used by ID3 and successors. The basic idea is to pick an attribute A with values [math]\displaystyle{ a_1, a_2, ..., a_r }[/math], split the training instances into subsets [math]\displaystyle{ S_{a1}, S_{a2}, ..., S_{ar} }[/math] consisting of those instances that have the corresponding attribute value. Then if a subset has only instances in a single class, that part of the tree stops with a leaf node labelled with the single class. If not, then the subset is split again, recursively, using a different attribute. This leaves the question of how to choose the best attribute to split on at any branch node. This issue is handled in the article on splitting criterion in ID3.