# Linear Least-Squares L2-Regularized Regression Task

A Linear Least-Squares L2-Regularized Regression Task is a linear least-squares regression task that is a regularized linear regression task which applies the l2-norm.

## References

### 2017a

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Tikhonov_regularization Retrieved:2017-8-20.
• Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.

Suppose that for a known matrix $A$ and vector $\mathbf{b}$ , we wish to find a vector $\mathbf{x}$ such that: : $A\mathbf{x}=\mathbf{b}$ The standard approach is ordinary least squares linear regression. However, if no $\mathbf{x}$ satisfies the equation or more than one $\mathbf{x}$ does — that is, the solution is not unique — the problem is said to be ill posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where $A$ maps $\mathbf{x}$ to $\mathbf{b}$ . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of $\mathbf{x}$ that is in the null-space of $A$ , rather than allowing for a model to be used as a prior for $\mathbf{x}$ . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as: : $\|A\mathbf{x}-\mathbf{b}\|^2$ where $\left \| \cdot \right \|$ is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization: : $\|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2$ for some suitably chosen Tikhonov matrix, $\Gamma$ . In many cases, this matrix is chosen as a multiple of the identity matrix ($\Gamma= \alpha I$ ), giving preference to solutions with smaller norms; this is known as L2 regularization. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by $\hat{x}$ , is given by: : $\hat{x} = (A^\top A+ \Gamma^\top \Gamma )^{-1}A^\top\mathbf{b}$ The effect of regularization may be varied via the scale of matrix $\Gamma$ . For $\Gamma = 0$ this reduces to the unregularized least squares solution provided that (ATA)−1 exists. L2 regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines, and matrix factorization.

### 2017b

• (Zhang, 2017) ⇒ Xinhua Zhang (2017). “Regularization" in “Encyclopedia of Machine Learning and Data Mining” (Sammut & Webb, 2017) pp 1083 - 1088 ISBN: 978-1-4899-7687-1, DOI: 10.1007/978-1-4899-7687-1_718
• QUOTE: An Illustrative Example: Ridge Regression

Ridge regression is illustrative of the use of regularization. It tries to fit the label $y$ by a linear model $\left \langle \mathbf{w},\mathbf{x}\right \rangle$ (inner product). So we need to solve a system of linear equations in $\mathbf{w}$: $(\mathbf{x}_{1},\ldots, \mathbf{x}_{n})^{\top }\mathbf{w} =\mathbf{ y}$, which is equivalent to a linear least square problem: $\min _{\mathbf{w}\in \mathbb{R}^{p}}\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2}$. If the rank of X is less than the dimension of $\mathbf{w}$, then it is overdetermined and the solution is not unique.

To approach this ill-posed problem, one needs to introduce additional assumptions on what models are preferred, i.e., the regularizer. One choice is to pick a matrix $\Gamma$ and regularize $\mathbf{w}$ by $\left \|\Gamma \mathbf{w}\right \|^{2}$. As a result we solve $\min _{\mathbf{w}\in \mathbb{R}^{p}}\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2} +\lambda \left \|\Gamma ^{\top }\mathbf{w}\right \|^{2}$, and the solution has a closed form $\mathbf{w}^{{\ast}} = (XX^{\top } +\lambda \Gamma \Gamma ^{\top })X\mathbf{y}$. $\Gamma$ can be simply the identity matrix which encodes our preference for small norm models.

The use of regularization can also be justified from a Bayesian point of view. Treating $\mathbf{w}$ as a multivariate random variable and the likelihood as $\exp \left (-\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2}\right )$, then the minimizer of $\left \|X^{\top }\mathbf{w} -\mathbf{ y}\right \|^{2}$ is just a maximum likelihood estimate of $\mathbf{w}$. However, we may also assume a prior distribution over $\mathbf{w}$, e.g., a Gaussian prior $p(\mathbf{w}) \sim \exp \left (-\lambda \left \|\Gamma ^{\top }\mathbf{w}\right \|^{2}\right )$. Then the solution of the ridge regression is simply the maximum a posterior estimate of $\mathbf{w}$.

### 2017 e.

• (Scikit-Learn, 2017) ⇒ "1.1.2. Ridge Regression" http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
• QUOTE: 1.1.2. Ridge Regression

Ridgeregression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares,

$\underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}}^2$

Here, $\alpha \geq 0$ is a complexity parameter that controls the amount of shrinkage: the larger the value of $\alpha$, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.