# Decision Tree Training Algorithm

A Decision Tree Training Algorithm is a supervised eager model-based learning algorithm that accepts a tree meta-model to produce a predictor tree.

**AKA:**Predictor Tree Induction Method.**Context:**- input: a Decision Tree Metamodel.
- It can be implemented by a Decision Tree Training System (that solves a tree training task).
- It can be a Discriminative Learning Algorithm.
- It can be a component of a Decision Tree Ensemble Learning Algorithm (implemented by a decision tree ensemble learning system).
- It can range from being a Classification Tree Learning Algorithm, to being a Ranking Tree Learning Algorithm, to being a Regression Tree Learning Algorithm.
- It can (typically) be a Greedy Top-Down Algorithm.
- It can be an Incremental Decision Tree Algorithm.
- It can have [math]\displaystyle{ O(nm^2) }[/math] Computational Complexity, for [math]\displaystyle{ n }[/math] rows, and [math]\displaystyle{ m }[/math] attributes.
- It can range from being a Fully-Supervised Decision Tree Learning Algorithm to being a Semi-Supervised Decision Tree Learning Algorithm.
- It can be a member of a Decision Tree Ensemble Learning Algorithm.
- …

**Example(s):**- a Classification Tree Learning Algorithm, such as: C4.5, CART, RPART, ID3, CHAID.
- a Ranking Tree Learning Algorithm.
- a Regression Tree Learning Algorithm, such as: RPART, MARS, Conditional Inference Trees Algorithm.
- …

**Counter-Example(s):****See:**Instance-based Learning Algorithm, Tree Learning Algorithm.

## References

### 2017a

- (Scikit-Learn, 2017) ⇒ Scikit-Learn (2007-2017) "1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART" Retrieved:2017-10-15.
- QUOTE: What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented in scikit-learn?
- ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.
- C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.
- C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.
- CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.
scikit-learn uses an optimised version of the CART algorithm.

- QUOTE: What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented in scikit-learn?

### 2017b

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Decision_tree_learning#Decision_tree_types Retrieved:2017-10-15.
- (...)
**Decision tree learning**is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.There are many specific decision-tree algorithms. Notable ones include:

- ID3 (Iterative Dichotomiser 3) * C4.5 (successor of ID3)
- CART (Classification And Regression Tree)
- CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.
- MARS: extends decision trees to handle numerical data better.
- Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.
^{[1]}^{[2]}

- ID3 and CART were invented independently at around the same time (between 1970 and 1980), yet follow a similar approach for learning decision tree from training tuples.

- (...)

- ↑ Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.
- ↑ Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

### 2017b

- (Furnkranz., 2017) ⇒ Johannes Fürnkranz (2017) "Decision Tree" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 330-335

### 2017c

- (Furnkranz., 2017) ⇒ Johannes Fürnkranz (2017) "Decision Tree", in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 330-335
- QUOTE:
**Well-Known Decision-Tree Learning Algorithms**The probably best-known decision-tree learning algorithm is C4.5 (Quinlan 1993), which is based upon (Quinlan 1983), which, in turn, has been derived from an earlier concept learning system (Hunt et al. 1966). ID3 realized the basic recursive partitioning algorithm for an arbitrary number of classes and for discrete attribute values. C4.5 (Quinlan 1993) incorporates several key improvements that were necessary for tackling real-world problems, including handling of numeric and missing attribute values, overfitting avoidance, and improved scalability. A C-implementation of C4.5 is freely available from its author. A re-implementation is available under the name J4.8 in the Weka data mining library. C5.0 is a commercial successor of C4.5, distributed by RuleQuest Research. CART (Breiman et al. 1984) is the best-known system in the statistical learning community. It is integrated into various statistical software packages, such as R or S.

Decision trees are also often used as components in Ensemble Methods such as random forests (Breiman 2001) or AdaBoost (Freund and Schapire 1996). They can also be modified for predicting numerical target variables, in which case they are known as regression trees. One can also put more complex prediction models into the leaves of a tree, resulting in Model Trees.

- QUOTE:

### 2017D

- (Zhao et al., 2017) ⇒ Qian Zhao, Yue Shi, and Liangjie Hong. (2017). “GB-CENT: Gradient Boosted Categorical Embedding and Numerical Trees.” In: Proceedings of the 26th International Conference on World Wide Web. ISBN:978-1-4503-4913-0 doi:10.1145/3038912.3052668
- QUOTE: Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. …

### 2011

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Decision_tree_learning
**Decision tree learning**, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are**classification trees**or regression trees. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining.

### 2005

- (Rokach & Maimon, 2005) ⇒ Lior Rokach, and Oded Maimon. (2005). “Chapter 9. Decision Trees.” In: Data Mining and Knowledge Discovery Handbook, Editors: Oded Z. Maimon, Lior Rokach ISBN:038725465X
- ABSTRACT: Decision Trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodologies.

### 2001

- (Breiman, 2001) ⇒ Leo Breiman. (2001). “Random Forests.” In: Machine Learning, 45(1). doi:10.1023/A:1010933404324
- ABSTRACT: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

### 2008

- (Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
- tree induction algorithm: This article describes the basic tree induction algorithm used by ID3 and successors. The basic idea is to pick an attribute A with values [math]\displaystyle{ a_1, a_2, ..., a_r }[/math], split the training instances into subsets [math]\displaystyle{ S_{a1}, S_{a2}, ..., S_{ar} }[/math] consisting of those instances that have the corresponding attribute value. Then if a subset has only instances in a single class, that part of the tree stops with a leaf node labelled with the single class. If not, then the subset is split again, recursively, using a different attribute. This leaves the question of how to choose the best attribute to split on at any branch node. This issue is handled in the article on splitting criterion in ID3.

### 1996

- (Quinlan, 1996) ⇒ J. Ross Quinlan. (1996). “Improved Use of Continuous Attributes in C4.5.” In: Journal of Artificial Intelligence Research, 4.

### 1993

- (Quinlan, 1993a) ⇒ J. Ross Quinlan. (1993). “C4.5: Programs for machine learning." Morgan Kaufmann. ISBN:1558602380

### 1984

- (Breiman et al., 1984) ⇒ Leo Breiman, Jerome H. Friedman, Charles J. Stone, and R. A. Olshen. (1984). “Classification and Regression Trees." Chapman & Hall/CRC. ISBN:0412048418