Linear Least-Squares L2-Regularized Regression Task

A Linear Least-Squares L2-Regularized Regression Task is a linear least-squares regression task that is a regularized linear regression task which applies the l2-norm.

AKA: Ridge Regression Task, Regularized Least-Squares Regression Task, Tikhonov Regularization Task.
Context:
- Task Input:
  a N-observed Numerically-Labeled Training Dataset [math]\displaystyle{ D=\{(x_1,y_1),(x_2,y_2),\cdots(x_n,y_n)\} }[/math] that can be represented by
  - [math]\displaystyle{ \mathbf{Y} }[/math] response variable continuous dataset.
  - [math]\displaystyle{ \mathbf{X} }[/math] predictor variables continuous dataset.
- output:
  - [math]\displaystyle{ \boldsymbol{\beta}=\{\beta_0,\beta_1,...,\beta_p\} }[/math], estimated linear model parameters vector, a continuous dataset.
  - [math]\displaystyle{ \mathbf{\hat{Y}}=f(x_i,\hat{\beta_j}) }[/math], predicted values (the Fitted Linear Function), a continuous dataset.
  - [math]\displaystyle{ \lambda }[/math], regularization parameters.
  - [math]\displaystyle{ \sigma_x,\sigma_y,\rho_{X,Y}... }[/math], standard deviations, correlation coefficient, standard error of estimate and other statistical information the fitting parameters.
- Task Requirements
  It requires to minimize a regularized objective function where the regularization function, [math]\displaystyle{ R(f) }[/math], is of the form L2 Norm. This is
  [math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}}\{E(f)+\lambda \sum_{j=0}^p \parallel \beta_j\parallel^2\} }[/math],
  [math]\displaystyle{ E(f) }[/math] is usually the linear least-squares task objective function.
  
  A regression diagnostic test to determine goodness of fit the regression model and the statistical significance of the estimated parameters.

Example(s):
For the linear regression task represented by the equation : [math]\displaystyle{ y_i=\beta_0+\beta_1x_i+\beta_2x_i+\cdots+\beta_px_i+\varepsilon_i }[/math], the ridge regression task can be solved by
[math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}} \{ \sum_{i=1}^n \parallel y_i - \sum_{j=1}^p x_{ij}\beta_j\parallel^2 + \lambda \sum_{j=1}^p \parallel\beta_j\parallel^2 \} }[/math] with [math]\displaystyle{ x_{i0}=0 }[/math] and [math]\displaystyle{ x_{ij}=x_i }[/math] for [math]\displaystyle{ j\gt 0 }[/math].

For the linear regression task in represented in the matrix form: [math]\displaystyle{ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{U} }[/math], the ridge regression task can be solved by
[math]\displaystyle{ \underset{\boldsymbol{\beta}}{\text{minimize}}\{ \parallel \mathbf{X}\boldsymbol{\beta} - \mathbf{Y} \parallel^2 + \lambda \parallel \boldsymbol{\mathbf{I}\beta}\parallel^2 \} }[/math] where [math]\displaystyle{ \mathbf{I} }[/math] is the identity matrix.

…

Counter-Example(s):
Bayesian Ridge Regression Task.

LASSO Regression Task.

Basis Pursuit Denoising Task.

Linear Least-Squares Regression Task.

See: Regularization Parameter, Regularization, L1 Regularization, Lp Regularization.

References
2017a
(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Tikhonov_regularization Retrieved:2017-8-20.
Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.
Suppose that for a known matrix [math]\displaystyle{ A }[/math] and vector [math]\displaystyle{ \mathbf{b} }[/math] , we wish to find a vector [math]\displaystyle{ \mathbf{x} }[/math] such that: : [math]\displaystyle{ A\mathbf{x}=\mathbf{b} }[/math] The standard approach is ordinary least squares linear regression. However, if no [math]\displaystyle{ \mathbf{x} }[/math] satisfies the equation or more than one [math]\displaystyle{ \mathbf{x} }[/math] does — that is, the solution is not unique — the problem is said to be ill posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where [math]\displaystyle{ A }[/math] maps [math]\displaystyle{ \mathbf{x} }[/math] to [math]\displaystyle{ \mathbf{b} }[/math] . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of [math]\displaystyle{ \mathbf{x} }[/math] that is in the null-space of [math]\displaystyle{ A }[/math] , rather than allowing for a model to be used as a prior for [math]\displaystyle{ \mathbf{x} }[/math] . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as: : [math]\displaystyle{ \|A\mathbf{x}-\mathbf{b}\|^2 }[/math] where [math]\displaystyle{ \left \| \cdot \right \| }[/math] is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization: : [math]\displaystyle{ \|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2 }[/math] for some suitably chosen Tikhonov matrix, [math]\displaystyle{ \Gamma }[/math] . In many cases, this matrix is chosen as a multiple of the identity matrix ([math]\displaystyle{ \Gamma= \alpha I }[/math] ), giving preference to solutions with smaller norms; this is known as $L 2$ regularization. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by [math]\displaystyle{ \hat{x} }[/math] , is given by: : [math]\displaystyle{ \hat{x} = (A^\top A+ \Gamma^\top \Gamma )^{-1}A^\top\mathbf{b} }[/math] The effect of regularization may be varied via the scale of matrix [math]\displaystyle{ \Gamma }[/math] . For [math]\displaystyle{ \Gamma = 0 }[/math] this reduces to the unregularized least squares solution provided that (A^TA)⁻¹ exists. $L 2$ regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines, and matrix factorization.
2017b
(Zhang, 2017) ⇒ Xinhua Zhang (2017). “Regularization" in “Encyclopedia of Machine Learning and Data Mining” (Sammut & Webb, 2017) pp 1083 - 1088 ISBN: 978-1-4899-7687-1, DOI: 10.1007/978-1-4899-7687-1_718
QUOTE: An Illustrative Example: Ridge Regression
Ridge regression is illustrative of the use of regularization. It tries to fit the label [math]\displaystyle{ y }[/math] by a linear model [math]\displaystyle{ \left \langle \mathbf{w},\mathbf{x}\right \rangle }[/math] (inner product). So we need to solve a system of linear equations in [math]\displaystyle{ \mathbf{w} }[/math]: [math]\displaystyle{ (\mathbf{x}_{1},\ldots, \mathbf{x}_{n})^{\top }\mathbf{w} =\mathbf{ y} }[/math], which is equivalent to a linear least square problem: [math]\displaystyle{ \min _{\mathbf{w}\in \mathbb{R}^{p}}\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2} }[/math]. If the rank of X is less than the dimension of [math]\displaystyle{ \mathbf{w} }[/math], then it is overdetermined and the solution is not unique.

To approach this ill-posed problem, one needs to introduce additional assumptions on what models are preferred, i.e., the regularizer. One choice is to pick a matrix [math]\displaystyle{ \Gamma }[/math] and regularize [math]\displaystyle{ \mathbf{w} }[/math] by [math]\displaystyle{ \left \|\Gamma \mathbf{w}\right \|^{2} }[/math]. As a result we solve [math]\displaystyle{ \min _{\mathbf{w}\in \mathbb{R}^{p}}\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2} +\lambda \left \|\Gamma ^{\top }\mathbf{w}\right \|^{2} }[/math], and the solution has a closed form [math]\displaystyle{ \mathbf{w}^{{\ast}} = (XX^{\top } +\lambda \Gamma \Gamma ^{\top })X\mathbf{y} }[/math]. [math]\displaystyle{ \Gamma }[/math] can be simply the identity matrix which encodes our preference for small norm models.

The use of regularization can also be justified from a Bayesian point of view. Treating [math]\displaystyle{ \mathbf{w} }[/math] as a multivariate random variable and the likelihood as [math]\displaystyle{ \exp \left (-\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2}\right ) }[/math], then the minimizer of [math]\displaystyle{ \left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2} }[/math] is just a maximum likelihood estimate of [math]\displaystyle{ \mathbf{w} }[/math]. However, we may also assume a prior distribution over [math]\displaystyle{ \mathbf{w} }[/math], e.g., a Gaussian prior [math]\displaystyle{ p(\mathbf{w}) \sim \exp \left (-\lambda \left \|\Gamma ^{\top }\mathbf{w}\right \|^{2}\right ) }[/math]. Then the solution of the ridge regression is simply the maximum a posterior estimate of [math]\displaystyle{ \mathbf{w} }[/math].
2017c
(Stats Stack Exchange, 2017) http://stats.stackexchange.com/questions/228763/regularization-methods-for-logistic-regression Retrieved: 2017-08-20
QUOTE: Yes, Regularization can be used in all linear methods, including both regression and classification. I would like to show you that there are not too much difference between regression and classification: the only difference is the loss function.
Specifically, there are three major components of linear method, Loss Function, Regularization, Algorithms. Where loss function plus regularization is the objective function in the problem in optimization form and the algorithm is the way to solve it (the objective function is convex, we will not discuss in this post).
In loss function setting, we can have different loss in both regression and classification cases. For example, Least squares and least absolute deviation loss can be used for regression. And their math representation are [math]\displaystyle{ L(\hat y,y)=(\hat y -y)^2 }[/math] and [math]\displaystyle{ L(\hat y,y)=|\hat y -y| }[/math]. (The function [math]\displaystyle{ L( \cdot ) }[/math] is defined on two scalar, [math]\displaystyle{ y }[/math] is ground truth value and [math]\displaystyle{ \hat y }[/math] is predicted value.)
On the other hand, logistic loss and hinge loss can be used for classification. Their math representations are [math]\displaystyle{ L(\hat y, y)=\log (1+ \exp(-\hat y y)) }[/math] and [math]\displaystyle{ L(\hat y, y)= (1- \hat y y)_+ }[/math]. (Here, [math]\displaystyle{ y }[/math] is the ground truth label in [math]\displaystyle{ \{-1,1\} }[/math] and [math]\displaystyle{ \hat y }[/math] is predicted "score". The definition of [math]\displaystyle{ \hat y }[/math] is a little bit unusual, please see the comment section.)
In regularization setting, you mentioned about the L1 and L2 regularization, there are also other forms, which will not be discussed in this post.
Therefore, in a high level a linear method is
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} L(w^{\top} x,y)+\lambda h(w) }[/math]
If you replace the Loss function from regression setting to logistic loss, you get the logistic regression with regularization.
For example, in ridge regression, the optimization problem is
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} (w^{\top} x-y)^2+\lambda w^\top w }[/math]
If you replace the loss function with logistic loss, the problem becomes
[math]\displaystyle{ \underset{w}{\text{minimize}}~~~ \sum_{x,y} \log(1+\exp{(-w^{\top}x \cdot y)})+\lambda w^\top w }[/math]
Here you have the logistic regression with L2 regularization.
2017D
(Quadrianto & Buntine, 2017) ⇒ Novi Quadrianto, Wray L. Buntine (2017). "Linear Regression" in "Encyclopedia of Machine Learning and Data Mining (2017)" pp 747-750 DOI:10.1007/978-1-4899-7687-1_481 ISBN: 978-1-4899-7687-1.
QUOTE: Regularized/Penalized Least Squares Method
The issue of over-fitting as mentioned in Regression is usually addressed by introducing a regularization or penalty term to the objective function. The regularized objective function is now in the form of [math]\displaystyle{ E_{\mathrm{reg}} = E(w) +\lambda R(w) }[/math] (9) Here [math]\displaystyle{ E }[/math]([math]\displaystyle{ w }[/math]) measures the quality (such as least squares quality) of the solution on the observed data points, [math]\displaystyle{ R }[/math]([math]\displaystyle{ w }[/math]) penalizes complex solutions, and [math]\displaystyle{ λ }[/math] is called the regularization parameter which controls the relative importance between the two. This regularized formulation is sometimes called coefficient shrinkage as it shrinks coefficients/weights toward zero (cf. coefficient subset selection formulation where the best [math]\displaystyle{ k }[/math] out of [math]\displaystyle{ H }[/math] basis functions are greedily selected). Two simple penalty terms [math]\displaystyle{ R }[/math]([math]\displaystyle{ w }[/math]) are given next, but more generally measures of curvature can also be used to penalize non-smooth functions.
Ridge Regression
The regularization term is in the form of [math]\displaystyle{ R(w) =\sum _{ d=1}^{D}w_{ d}^{2}. }[/math] (10) Considering [math]\displaystyle{ E }[/math]([math]\displaystyle{ w }[/math]) to be in the form of (1), the regularized least squares quality function is now [math]\displaystyle{ (Xw - y)^{T}(Xw - y) +\lambda w^{T}w. }[/math] (11)
Since the additional term is a quadratic of [math]\displaystyle{ w }[/math], the regularized objective function is still quadratic in [math]\displaystyle{ w }[/math], thus the optimal solution is unique and can be found in closed form. As before, setting the first derivative of (11) with respect to [math]\displaystyle{ w }[/math] to zero, the optimal weight vector is in the form of [math]\displaystyle{ \begin{array}{rcl} \partial _{w}E_{\mathrm{reg}}(w)& =& 2X^{T}(Xw\,-\,y)\,+\,2\lambda w\,=\,0{}\end{array} }[/math] (12) [math]\displaystyle{ \begin{array}{rcl} w^{{\ast}} = (X^{T}X +\lambda I)^{-1}X^{T}y.& &{}\end{array} }[/math] (13)
The effect of the regularization term is to put a small weight for those basis functions which are useful only in a minor way as the penalty for small weights is very small.
2017 e.
(Scikit-Learn, 2017) ⇒ "1.1.2. Ridge Regression" http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
QUOTE: 1.1.2. Ridge Regression
Ridgeregression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares,
[math]\displaystyle{ \underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}}^2 }[/math]
Here, [math]\displaystyle{ \alpha \geq 0 }[/math] is a complexity parameter that controls the amount of shrinkage: the larger the value of [math]\displaystyle{ \alpha }[/math], the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.