2019 ReconcilingModernMachineLearnin

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Bias-Variance Trade-Off.

Notes

Cited By

Quotes

Abstract

Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in modern machine-learning practice. The bias-variance trade-off implies that a model should balance underfitting and overfitting: Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns. However, in modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered overfitted, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double-descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine-learning models delineates the limits of classical analyses and has implications for both the theory and the practice of machine learning.

Significance

While breakthroughs in machine learning and artificial intelligence are changing society, our fundamental understanding has lagged behind. It is traditionally believed that fitting models to the training data exactly is to be avoided as it leads to poor performance on unseen data. However, powerful modern classifiers frequently have near-perfect fit in training, a disconnect that spurred recent intensive research and controversy on whether theory provides practical insights. In this work, we show how classical theory and modern practice can be reconciled within a single unified performance curve and propose a mechanism underlying its emergence. We believe this previously unknown pattern connecting the structure and performance of learning architectures will help shape design and understanding of learning algorithms.

Body

Machine learning has become key to important applications in science, technology, and commerce. The focus of machine learning is on the problem of prediction: Given a sample of training examples(x1,y1),..., (xn,yn)fromRd×R, we learn a predictorhn:Rd→Rthat is used to predict the labelyof a new pointx, unseen in training.

The predictor hn is commonly chosen from some function classH, such as neural networks with a certain architecture, using empirical risk minimization (ERM) and its variants. In ERM,the predictor is taken to be a function h∈H that minimizes the empirical (or training) risk1n∑ni=1`(h(xi),yi), where`isa loss function, such as the squared loss`(y′,y) = (y′−y)2forregression or 0–1 loss`(y′,y) =1{y′6=y}for classification.

The goal of machine learning is to find hn that performs well on new data, unseen in training. To study performance on new data (known as generalization), we typically assume the training examples are sampled randomly from a probability distribution P over Rd × R and evaluate hn on a new test example (x,y) drawn independently from P. The challenge stems from the mismatch between the goals of minimizing the empirical risk (the explicit goal of ERM algorithms, optimization) and minimizing the true (or test) riskE(x,y)∼P[`(h(x),y)] (the goal of machine learning).

Conventional wisdom in machine learning suggests controlling the capacity of the function class H based on the bias–variance trade-off by balancing underfitting and overfitting (cf. refs. 1 and2): 1) If H is too small, all predictors in H may under fit the training data (i.e., have large empirical risk) and hence predict poorlyon new data. 2) If H is too large, the empirical risk minimizer may overfit spurious patterns in the training data, resulting inpoor accuracy on new examples (small empirical risk but large true risk).

The classical thinking is concerned with finding the “sweet-spot” between underfitting and overfitting. The control of the function class capacity may be explicit, via the choice of (e.g., picking the neural network architecture), or it may be implicit, using regularization (e.g., early stopping). When a suitable balance is achieved, the performance of hn on the training data is said to generalize to the population P. This is summarized in the classical U-shaped risk curve shown in Fig. 1A that has been widely used to guide model selection and is even thought to describe aspects of human decision making (3). The textbook corollary of this curve is that “a model with zero training error is overfit to the training data and will typically generalize poorly” (ref. 2, p. 221), a view still widely accepted.

However, practitioners routinely use modern machine-learning methods, such as large neural networks and other non-linear predictors that have very low or zero training risk. Despite the high function class capacity and near-perfect fit to training data, these predictors often give very accurate predictions on new data. Indeed, this behavior has guided a best practice in deep-learning for choosing neural network architectures, specifically that the network should be large enough to permit effortless zero-loss training (called interpolation) of the training data (4). Moreover, in direct challenge to the bias–variance trade-off philosophy, recent empirical evidence indicates that neural …

Figure 1: Curves for training risk (dashed line) and test risk (solid line). (A) The classical U-shaped risk curve arising from the bias–variance trade-off. (B) Thedouble-descent risk curve, which incorporates the U-shaped risk curve (i.e., the “classical” regime) together with the observed behavior from using high-capacity function classes (i.e., the “modern” interpolating regime), separated by the interpolation threshold. The predictors to the right of the interpolationthreshold have zero training risk.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2019 ReconcilingModernMachineLearninMikhail Belkin
Daniel Hsu
Siyuan Ma
Soumik Mandal
Reconciling Modern Machine-learning Practice and the Classical Bias--variance Trade-off2019