- (Caruana & Niculescu-Mizil, 2006) ⇒ Rich Caruana, and Alexandru Niculescu-Mizil. (2006). “An Empirical Comparison of Supervised Learning Algorithms.” In: Proceedings of the 23rd International Conference on Machine learning. ISBN:1-59593-383-2 doi:10.1145/1143844.1143865
- It evaluate the performance of SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps
- It evaluates performance on eleven binary classification problems
- It uses a variety of performance metrics: accuracy, F-score, Lift, ROC Area, average precision, precision recall break-even point, squared error, and cross-entropy.
- For each algorithm it examines common variations, and the space of parameters (e.g. decision tree styles, neural nets of many sizes, SVMs with many kernels, ...)
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90's. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods.
There are few comprehensive empirical studies comparing learning algorithms. STATLOG is perhaps the best known study (King et al., 1995). STATLOG was very comprehensive when it was performed, but since then new learning algorithms have emerged (e.g., bagging, boosting, SVMs, random forests) that have excellent performance. An extensive empirical evaluation of modern learning methods would be useful.
Learning algorithms are now used in many domains, and different performance metrics are appropriate for each domain. For example Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks, etc. The different performance metrics measure different tradeoffs in the predictions made by a classifier, and it is possible for learning methods to perform well on one metric, but be suboptimal on other metrics. Because of this it is important to evaluate algorithms on a broad set of performance metrics.
This paper presents results of a large-scale empirical comparison of ten supervised learning algorithms using eight performance criteria. We evaluate the performance of SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps on eleven binary classification problems using a variety of performance metrics: accuracy, F-score, Lift, ROC Area, average precision, precision recall break-even point, squared error, and cross-entropy. For each algorithm we examine common variations, and thoroughly explore the space of parameters. For example, we compare ten decision tree styles, neural nets of many sizes, SVMs with many kernels, etc.
Because some of the performance metrics we examine interpret model predictions as probabilities and models such as SVMs are not designed to predict probabilities, we compare the performance of each algorithm both before and after calibrating its predictions with Platt Scaling and Isotonic Regression.
The empirical results are surprising. To preview: prior to calibration, bagged trees, random forests, and neural nets give the best average performance across all eight metrics and eleven test problems. Boosted trees, however, are best if we restrict attention to the six metrics that do not require probabilities. After calibration with Platt's Method, boosted trees predict better probabilities than all other methods and move into first place overall. Neural nets, on the other hand, are so well calibrated to begin with that they are hurt slightly by calibration. After calibration with Platt's Method or Isotonic Regression, SVMs perform comparably to neural nets and nearly as well as boosted trees, random forests and bagged trees. Boosting full decision trees dramatically outperforms boosting weaker stumps on most problems. On average, memory-based learning, boosted stumps, single decision trees, logistic regression, and naive bayes are not competitive with the best methods. These generalizations, however, do not always hold. For example, boosted stumps and logistic regression, which perform poorly on average, are the best models for some metrics on two of the test problems.
2.1. Learning Algorithms
We attempt to explore the space of parameters and common variations for each learning algorithm as thoroughly as is computationally feasible. This section summarizes the parameters used for each learning algorithm, and may safely be skipped by readers who are easily bored.
SVMs: we use the following kernels in SVMLight (Joachims, 1999): linear, polynomial degree 2 & 3, radial with width f0.001,0.005,0.01,0.05,0.1,0.5,1,2g. We also vary the regularization parameter by factors of ten from 107 to 103 with each kernel.
ANN: we train neural nets with gradient descent backprop and vary the number of hidden units f1,2,4,8,32,128g and the momentum f0,0.2,0.5,0.9g. We halt training the nets at many different epochs and use validation sets to select the best nets.
Logistic Regression (LOGREG): we train both unregularized and regularized models, varying the ridge (regularization) parameter by factors of 10 from 108 to 104.
Naive Bayes (NB): we use Weka (Witten & Frank, 2005) and try all three of the Weka options for han- dling continuous attributes: modeling them as a single normal, modeling them with kernel estimation, or dis- cretizing them using supervised discretization.
KNN: we use 26 values of K ranging from K = 1 to K = jtrainsetj. We use KNN with Euclidean distance and Euclidean distance weighted by gain ratio. We also use distance weighted KNN, and locally weighted averaging. The kernel widths for locally weighted aver- aging vary from 20 to 210 times the minimum distance between any two points in the train set.
Random Forests (RF): we tried both the Breiman-Cutler and Weka implementations; Breiman-Cutler yielded better results so we report those here. The forests have 1024 trees. The size of the feature set considered at each split is 1,2,4,6,8,12,16 or 20.
Decision trees (DT): we vary the splitting criterion, pruning options, and smoothing (Laplacian or Bayesian smoothing). We use all of the tree models in Buntine's IND package (Buntine & Caruana, 1991): BAYES, ID3, CART, CART0, C4, MML, and SMML. We also generate trees of type C44LS (C4 with no pruning and Laplacian smoothing), C44BS (C44 with Bayesian smoothing), and MMLLS (MML with Laplacian smoothing). See (Provost & Domingos, 2003) for a description of C44LS.
Bagged trees (BAG-DT): we bag 100 trees of each type described above. With boosted trees (BST-DT) we boost each tree type as well. Boosting can overfit, so we consider boosted trees after 2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting. With boosted stumps (BST-STMP) we boost single level decision trees generated with 5 different splitting criteria, each boosted for 2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 steps.
With LOGREG, ANN, SVM and KNN we scale attributes to 0 mean 1 std. With DT, RF, NB, BAG-DT, BST-DT and BST-STMP we don't scale the data. In total, we train about 2000 different models in each trial on each problem.
2.2. Performance Metrics
We divide the eight performance metrics into three groups: threshold metrics, ordering/rank metrics and probability metrics.
The threshold metrics are accuracy (ACC), F-score (FSC) and lift (LFT). See (Giudici, 2003) for a description of Lift Curves. Usually ACC and FSC have a fixed threshold (we use 0.5). For lift, often a fixed percent, p, of cases are predicted as positive and the rest as negative (we use p = 25%). For thresholded metrics, it is not important how close a prediction is to a threshold, only if it is above or below threshold.
The ordering/rank metrics depend only on the ordering of the cases, not the actual predicted values. As long as ordering is preserved, it makes no difference if predicted values fall between 0 and 1 or 0.89 and 0.90. These metrics measure how well the positive cases are ordered before negative cases and can be viewed as a summary of model performance across all possible thresholds. The rank metrics we use are area under the ROC curve (ROC), average precision (APR), and precision/recall break even point (BEP). See (Provost & Fawcett, 1997) for a discussion of ROC from a machine learning perspective.
The probability metrics, squared error (RMS) and cross-entropy (MXE), interpret the predicted value of each case as the conditional probability of that case being in the positive class. See (Caruana & Niculescu-Mizil, 2004) for a more detailed description of the eight performance metrics.
The field has made substantial progress in the last decade. Learning methods such as boosting, random forests, bagging, and SVMs achieve excellent performance that would have been difficult to obtain just 15 years ago. Of the earlier learning methods, feedforward neural nets have the best performance and are competitive with some of the newer methods, particularly if models will not be calibrated after training.
Calibration with either Platt's method or Isotonic Regression is remarkably effective at obtaining excellent performance on the probability metrics from learning algorithms that performed well on the ordering metrics. Calibration dramatically improves the performance of boosted trees, SVMs, boosted stumps, and Naive Bayes, and provides a small, but noticeable improvement for random forests. Neural nets, bagged trees, memory based methods, and logistic regression are not significantly improved by calibration.
With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. Although some methods clearly perform better or worse than other methods on average, there is significant variability across the problems and metrics. Even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well.
|2006 AnEmpiricalComparisonofSupervis||Rich Caruana|
|An Empirical Comparison of Supervised Learning Algorithms||10.1145/1143844.1143865||2006|