Decision Tree (DT) Training Algorithm

A Decision Tree (DT) Training Algorithm is a supervised eager model-based learning algorithm that accepts a tree meta-model to produce a predictor tree.

AKA: Predictor Tree Induction Method, Tree Induction Algorithm, Decision Tree Learning Algorithm.
Context:
- input: a Decision Tree Metamodel.
  - optional: a Decision Tree Prunning Criterion, a Decision Tree Splitting Criterion, decision tree stopping criterion, decision tree hyperparameters.
- output: a Trained Decision Tree Model, which can be a classification tree, regression tree, or ranking tree.
- Algorithm Performance Measure: decision tree accuracy metrics such as classification error, mean squared error, and information gain.
- ...
- It can typically construct decision tree nodes through recursive partitioning of the training dataset.
- It can typically select decision tree splitting attributes using information theoretic measures such as entropy, gini impurity, and variance reduction.
- It can typically handle decision tree categorical features through multi-way splits or binary encoding.
- It can typically process decision tree numerical features through threshold-based binary splits.
- It can typically address decision tree overfitting through post-pruning techniques and pre-pruning mechanisms.
- ...
- It can often implement decision tree missing value handling through surrogate splits, instance weighting, or probabilistic assignment.
- It can often incorporate decision tree cost-sensitive learning through misclassification cost matrix and weighted instance.
- It can often support decision tree interpretability through rule extraction and feature importance calculation.
- It can often facilitate decision tree visualization through hierarchical node display and path highlighting.
- ...
- It can range from being a Simple Decision Tree (DT) Training Algorithm to being a Complex Decision Tree (DT) Training Algorithm, depending on its decision tree splitting strategy and decision tree optimization approach.
- It can range from being a Univariate Decision Tree (DT) Training Algorithm to being a Multivariate Decision Tree (DT) Training Algorithm, depending on its decision tree node decision boundary complexity.
- It can range from being a Batch Decision Tree (DT) Training Algorithm to being an Online Decision Tree (DT) Training Algorithm, depending on its decision tree training data processing methodology.
- It can range from being a Fully-Supervised Decision Tree (DT) Training Algorithm to being a Semi-Supervised Decision Tree (DT) Training Algorithm, depending on its decision tree labeled data requirement.
- ...
- It can be implemented by a Decision Tree Training System (that solves a tree training task).
- It can be a Discriminative Learning Algorithm.
- It can be a component of a Decision Tree Ensemble Learning Algorithm (implemented by a decision tree ensemble learning system).
- It can range from being a Classification Tree Learning Algorithm, to being a Ranking Tree Learning Algorithm, to being a Regression Tree Learning Algorithm.
- It can (typically) be a Greedy Top-Down Algorithm.
- It can be an Incremental Decision Tree Algorithm.
- It can have [math]\displaystyle{ O(nm^2) }[/math] Computational Complexity, for [math]\displaystyle{ n }[/math] rows, and [math]\displaystyle{ m }[/math] attributes.
- It can be a member of a Decision Tree Ensemble Learning Algorithm.
- It can integrate with feature selection algorithms for decision tree feature space optimization.
- It can connect to model evaluation systems for decision tree performance assessment.
- It can support automated machine learning pipelines for decision tree hyperparameter tuning.
- ...
Examples:
Counter-Examples:
- Logistic Regression Algorithm, which creates a linear decision boundary rather than a hierarchical partitioning of the feature space.
- Decision Graph Learning Algorithm, which allows for node merging and does not maintain a strict tree structure.
- Decision Tree Testing Algorithm, which focuses on inference using trained trees rather than decision tree construction.
- Supervised Rule List Learning Algorithm, which produces sequential rule lists rather than hierarchical tree structures.
- Inductive Logic Programming Algorithm, which learns logical predicates from examples rather than attribute-based splits.
- Latent Factor Algorithm, which learns low-dimensional representations rather than explicit decision boundaries.
See: Instance-based Learning Algorithm, Tree Learning Algorithm, Random Forest Algorithm, Gradient Boosting Algorithm, XGBoost Algorithm, LightGBM Algorithm.

References

2017a

(Scikit-Learn, 2017) ⇒ Scikit-Learn (2007-2017) "1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART" Retrieved:2017-10-15.
- QUOTE: What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented in scikit-learn?
  - ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.
  - C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.
  - C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.
  - CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.
    scikit-learn uses an optimised version of the CART algorithm.

2017b

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Decision_tree_learning#Decision_tree_types Retrieved:2017-10-15.
- (...) Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.
  There are many specific decision-tree algorithms. Notable ones include:
  - ID3 (Iterative Dichotomiser 3) * C4.5 (successor of ID3)
  - CART (Classification And Regression Tree)
  - CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.
  - MARS: extends decision trees to handle numerical data better.
  - Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.^[1] ^[2]
- ID3 and CART were invented independently at around the same time (between 1970 and 1980), yet follow a similar approach for learning decision tree from training tuples.

↑ Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.
↑ Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

2017b

(Furnkranz., 2017) ⇒ Johannes Fürnkranz (2017) "Decision Tree" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 330-335

2017c

(Furnkranz., 2017) ⇒ Johannes Fürnkranz (2017) "Decision Tree", in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 330-335
- QUOTE: Well-Known Decision-Tree Learning Algorithms
  The probably best-known decision-tree learning algorithm is C4.5 (Quinlan 1993), which is based upon (Quinlan 1983), which, in turn, has been derived from an earlier concept learning system (Hunt et al. 1966). ID3 realized the basic recursive partitioning algorithm for an arbitrary number of classes and for discrete attribute values. C4.5 (Quinlan 1993) incorporates several key improvements that were necessary for tackling real-world problems, including handling of numeric and missing attribute values, overfitting avoidance, and improved scalability. A C-implementation of C4.5 is freely available from its author. A re-implementation is available under the name J4.8 in the Weka data mining library. C5.0 is a commercial successor of C4.5, distributed by RuleQuest Research. CART (Breiman et al. 1984) is the best-known system in the statistical learning community. It is integrated into various statistical software packages, such as R or S.
  Decision trees are also often used as components in Ensemble Methods such as random forests (Breiman 2001) or AdaBoost (Freund and Schapire 1996). They can also be modified for predicting numerical target variables, in which case they are known as regression trees. One can also put more complex prediction models into the leaves of a tree, resulting in Model Trees.

2017D

(Zhao et al., 2017) ⇒ Qian Zhao, Yue Shi, and Liangjie Hong. (2017). “GB-CENT: Gradient Boosted Categorical Embedding and Numerical Trees.” In: Proceedings of the 26th International Conference on World Wide Web. ISBN:978-1-4503-4913-0 doi:10.1145/3038912.3052668
- QUOTE: Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. …

2011

(Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Decision_tree_learning
- Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining.

2005

(Rokach & Maimon, 2005) ⇒ Lior Rokach, and Oded Maimon. (2005). “Chapter 9. Decision Trees.” In: Data Mining and Knowledge Discovery Handbook, Editors: Oded Z. Maimon, Lior Rokach ISBN:038725465X
- ABSTRACT: Decision Trees are considered to be one of the most popular approaches for representing classifiers. Researchers from various disciplines such as statistics, machine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated survey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodologies.

2001

(Breiman, 2001) ⇒ Leo Breiman. (2001). “Random Forests.” In: Machine Learning, 45(1). doi:10.1023/A:1010933404324
- ABSTRACT: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

2008

(Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
- tree induction algorithm: This article describes the basic tree induction algorithm used by ID3 and successors. The basic idea is to pick an attribute A with values [math]\displaystyle{ a_1, a_2, ..., a_r }[/math], split the training instances into subsets [math]\displaystyle{ S_{a1}, S_{a2}, ..., S_{ar} }[/math] consisting of those instances that have the corresponding attribute value. Then if a subset has only instances in a single class, that part of the tree stops with a leaf node labelled with the single class. If not, then the subset is split again, recursively, using a different attribute. This leaves the question of how to choose the best attribute to split on at any branch node. This issue is handled in the article on splitting criterion in ID3.

1996

(Quinlan, 1996) ⇒ J. Ross Quinlan. (1996). “Improved Use of Continuous Attributes in C4.5.” In: Journal of Artificial Intelligence Research, 4.

1993

(Quinlan, 1993a) ⇒ J. Ross Quinlan. (1993). “C4.5: Programs for machine learning." Morgan Kaufmann. ISBN:1558602380

1984

(Breiman et al., 1984) ⇒ Leo Breiman, Jerome H. Friedman, Charles J. Stone, and R. A. Olshen. (1984). “Classification and Regression Trees." Chapman & Hall/CRC. ISBN:0412048418

[Hothorn2006-1] Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.

[Strobl2009-2] Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

[1]

[2]

Decision Tree (DT) Training Algorithm

References

2017a

2017b

2017b

2017c

2017D

2011

2005

2001

2008

1996

1993

1984

Navigation menu

Search