2013 ARiskComparisonofOrdinaryLeastS

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Ordinary Least Squares Estimate.

Notes

Cited By

Quotes

Abstract

We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR).

1. Introduction

Consider the fixed design setting where we have a set of n vectors X = {X_i}, and let X denote the matrix where the ith row of X is Xi. The observed label vector is [math]\displaystyle{ Y \in R^n }[/math].

Suppose that: Y = Xb+e, where e is independent noise in each coordinate, with the variance of ei being s2. The objective is to learn E[Y] = Xb. The expected loss of a vector b estimator is: L(b) = 1 n EY[kY -Xbk2], Let ˆb be an estimator of b (constructed with a sample Y). Denoting � := 1 n XTX,

we have that the risk (i.e., expected excess loss) is: Risk(ˆb) := Eˆb [L(ˆb)-L(b)] = Eˆb kˆb -bk2 �, where kxk� = x?�x and where the expectation is with respect to the randomness in Y.

We show that a simple variant of ordinary (un-regularized) least squares always compares favorably to ridge regression (as measured by the risk). This observation is based on the following bias variance decomposition:

Risk(ˆb) = Ekˆb - ¯b k2 � | {z }

Variance + k¯b -bk2 � | {z }

Prediction Bias , (1) where ¯b = E[ˆb].

1.1 The Risk of Ridge Regression (RR)

Ridge regression or Tikhonov Regularization (Tikhonov, 1963) penalizes the l2 norm of a parameter vector b and “shrinks” it towards zero, penalizing large values more. The estimator is: ˆb l = argmin b {kY -Xbk2+lkbk2}. The closed form estimate is then: ˆb l = (�+lI)-1 � 1 n XTY � .

Note that ˆb 0 = ˆbl=0 = argmin b {kY -Xbk2}, is the ordinary least squares estimator. Without loss of generality, rotate X such that: � = diag(l1,l2, . . . ,lp), where the li’s are ordered in decreasing order.

To see the nature of this shrinkage observe that: [ˆbl] j := lj lj +l [ˆb0] j, where ˆb0 is the ordinary least squares estimator.

2. Ordinary Least Squares with PCA (PCA-OLS)

Now let us construct a simple estimator based on [math]\displaystyle{ \lambda }[/math]. Note that our rotated coordinate system where � is equal to diag(l1,l2, . . . ,lp) corresponds the PCA coordinate system.

Consider the following ordinary least squares estimator on the “top” PCA subspace — it uses the least squares estimate on coordinate j if lj = l and 0 otherwise

[ˆbPCA,l] j = � [ˆb0] j if lj = l 0 otherwise .

3. Experiments

First, we generated synthetic data with p = 100 and varying values of n= {20, 50, 80, 110}. …

4. Conclusion

We showed that the risk inflation of a particular ordinary least squares estimator (on the “top” PCA subspace) is within a factor 4 of the ridge estimator. It turns out the converse is not true — this PCA estimator may be arbitrarily better than the ridge one.

References

  • 1. D. P. Foster and E. I. George. The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics, Pages 1947-1975, 1994.
  • 2. A. N. Tikhonov. Solution of Incorrectly Formulated Problems and the Regularization Method. Soviet Math Dokl 4, Pages 501-504, 1963.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 ARiskComparisonofOrdinaryLeastSLyle H. Ungar
Paramveer S. Dhillon
Dean P. Foster
Sham M. Kakade
A Risk Comparison of Ordinary Least Squares Vs Ridge Regression