# Linear Least-Squares L2-Regularized Regression Algorithm

An Linear Least-Squares L2-Regularized Regression Algorithm is a regularized regression algorithm can be implemented by a L2-Regularized Optimization System to solve an linear least-squares l2-regularized regression task)

**AKA:**ℓ2 Ridge Regression Method, Tikhonov Regularization Technique, Ridge Regression Algorithm, Phillips-Twomey Algorithm.**Context:**- ...

**Example(s):****Counter-Example(s):****See:**L2-Norm Regularizer, Regularization, Supervised Estimation Algorithm, Regularized Supervised Learning Algorithm, Parameter Shrinkage, Support Vector Machine, Non-Linear Least Squares.

## References

### 2017

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Tikhonov_regularization Retrieved:2017-8-13.
**Tikhonov regularization**, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as**ridge regression**, in machine learning it is known as**weight decay**, and with multiple independent discoveries, it is also variously known as the**Tikhonov–Miller method**, the**Phillips–Twomey method**, the**constrained linear inversion**method, and the method of**linear regularization**. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.Suppose that for a known matrix [math] A [/math] and vector [math] \mathbf{b} [/math] , we wish to find a vector [math] \mathbf{x} [/math] such that: : [math] A\mathbf{x}=\mathbf{b} [/math] The standard approach is ordinary least squares linear regression. However, if no [math] \mathbf{x} [/math] satisfies the equation or more than one [math] \mathbf{x} [/math] does — that is, the solution is not unique — the problem is said to be ill posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where [math] A [/math] maps [math] \mathbf{x} [/math] to [math] \mathbf{b} [/math] . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of [math] \mathbf{x} [/math] that is in the null-space of [math] A [/math] , rather than allowing for a model to be used as a prior for [math] \mathbf{x} [/math] . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as: : [math] \|A\mathbf{x}-\mathbf{b}\|^2 [/math] where [math] \left \| \cdot \right \| [/math] is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization: : [math] \|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2 [/math] for some suitably chosen

**Tikhonov matrix**, [math] \Gamma [/math] . In many cases, this matrix is chosen as a multiple of the identity matrix ( [math] \Gamma= \alpha I [/math] ), giving preference to solutions with smaller norms; this is known as. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by [math] \hat{x} [/math] , is given by: : [math] \hat{x} = (A^\top A+ \Gamma^\top \Gamma )^{-1}A^\top\mathbf{b} [/math] The effect of regularization may be varied via the scale of matrix [math] \Gamma [/math] . For [math] \Gamma = 0 [/math] this reduces to the unregularized least squares solution provided that (A*L*_{2}regularization^{T}A)^{−1}exists.*L*_{2}regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines, and matrix factorization.

### 2011a =

- (Quadrianto & Buntine, 2011) ⇒ Novi Quadrianto and Wray L. Buntine (2011). "Linear Regression" In: (Sammut & Webb, 2011) pp 747-750.
- QUOTE:
**Ridge Regression**- The regularization term is in the form of[math]R(w)=\sum_{d=1}^Dw^2_d \quad\quad[/math](10)

Considering [math]E(w)[/math] to be in the form of (1), the regularized least squares quality function is now

[math](Xw−y)^T(Xw−y)+λw^Tw \quad\quad[/math](11)

Since the additional term is a quadratic of [math]w[/math], the regularized objective function is still quadratic in [math]w[/math], thus the optimal solution is unique and can be found in closed form. As before, setting the first derivative of (11) with respect to [math]w[/math] to zero, the optimal weight vector is in the form of

[math]\partial wEreg(w)=2X^T(Xw−y)+\lambda w=0\quad\quad[/math](12)

[math]w∗=(X^TX+\lambda I)^{−1}X^Ty\quad\quad[/math](13)

The effect of the regularization term is to put a small weight for those basis functions which are useful only in a minor way as the penalty for small weights is very small.

- QUOTE:

### 2011b

- (Zhang) ⇒ Xinhua Zhang (2017)"Regularization" In: "Encyclopedia of Machine Learning and Data Mining" (Sammut & Webb, 2017), Springer US, Boston MA, pp 1083-1088. [ISBN:978-1-4899-7687-1], DOI:10.1007/978-1-4899-7687-1_718
- QUOTE: Ridge regression is illustrative of the use of regularization. It tries to fit the label [math]y[/math] by a linear model [math] \langle w,x \rangle [/math] (inner product). So we need to solve a system of linear equations in [math]w: (x_1,\cdots, x_n)^T w= y[/math], which is equivalent to a linear least square problem: [math]min_{w\in\mathcal{R}^p} \parallel X^Tw-y \parallel^2[/math]. If the rank of [math]X[/math] is less than the dimension of [math]w[/math], then it is overdetermined and the solution is not unique.
To approach this ill-posed problem, one needs to introduce additional assumptions on what models are preferred, i.e., the regularizer. One choice is to pick a matrix [math]\Gamma [/math] and regularize [math]w[/math] by [math]\parallel \Gamma w \parallel[/math]. As a result we solve [math]min_{w\in\mathcal{R}^p} \parallel X^Tw-y \parallel ^2+\lambda \parallel \Gamma^T w\parallel^2 [/math], and the solution has a closed form [math]w^*=(XX^T+\lambda \Gamma\Gamma^T)Xy[/math] can be simply the identity matrix which encodes our preference for small norm models. The use of regularization can also be justified from a Bayesian point of view. Treating [math]\exp\left(-\parallel X^TW-y \parallel^2\right)[/math] as a multivariate random variable and the likelihood as [math][/math], then the minimizer of [math]\parallel X^TW-y \parallel^2[/math] is just a maximum likelihood estimate of [math]w[/math]. However, we may also assume a prior distribution over [math]w[/math], e.g., a Gaussian prior [math]\exp\left(-\parallel\lambda^TW\parallel^2\right)[/math]. Then the solution of the ridge regression is simply the maximum a posterior estimate of [math]w[/math].

- QUOTE: Ridge regression is illustrative of the use of regularization. It tries to fit the label [math]y[/math] by a linear model [math] \langle w,x \rangle [/math] (inner product). So we need to solve a system of linear equations in [math]w: (x_1,\cdots, x_n)^T w= y[/math], which is equivalent to a linear least square problem: [math]min_{w\in\mathcal{R}^p} \parallel X^Tw-y \parallel^2[/math]. If the rank of [math]X[/math] is less than the dimension of [math]w[/math], then it is overdetermined and the solution is not unique.