Linear Least-Squares L2-Regularized Regression Algorithm

From GM-RKB
(Redirected from Ridge regression)
Jump to navigation Jump to search

An Linear Least-Squares L2-Regularized Regression Algorithm is a regularized regression algorithm can be implemented by a L2-Regularized Optimization System to solve an linear least-squares l2-regularized regression task)



References

2017

  • (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Tikhonov_regularization Retrieved:2017-8-13.
    • Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, in machine learning it is known as weight decay, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.

      Suppose that for a known matrix [math]\displaystyle{ A }[/math] and vector [math]\displaystyle{ \mathbf{b} }[/math] , we wish to find a vector [math]\displaystyle{ \mathbf{x} }[/math] such that: : [math]\displaystyle{ A\mathbf{x}=\mathbf{b} }[/math] The standard approach is ordinary least squares linear regression. However, if no [math]\displaystyle{ \mathbf{x} }[/math] satisfies the equation or more than one [math]\displaystyle{ \mathbf{x} }[/math] does — that is, the solution is not unique — the problem is said to be ill posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where [math]\displaystyle{ A }[/math] maps [math]\displaystyle{ \mathbf{x} }[/math] to [math]\displaystyle{ \mathbf{b} }[/math] . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of [math]\displaystyle{ \mathbf{x} }[/math] that is in the null-space of [math]\displaystyle{ A }[/math] , rather than allowing for a model to be used as a prior for [math]\displaystyle{ \mathbf{x} }[/math] . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as: : [math]\displaystyle{ \|A\mathbf{x}-\mathbf{b}\|^2 }[/math] where [math]\displaystyle{ \left \| \cdot \right \| }[/math] is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization: : [math]\displaystyle{ \|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2 }[/math] for some suitably chosen Tikhonov matrix, [math]\displaystyle{ \Gamma }[/math] . In many cases, this matrix is chosen as a multiple of the identity matrix ([math]\displaystyle{ \Gamma= \alpha I }[/math] ), giving preference to solutions with smaller norms; this is known as L2 regularization. In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by [math]\displaystyle{ \hat{x} }[/math] , is given by: : [math]\displaystyle{ \hat{x} = (A^\top A+ \Gamma^\top \Gamma )^{-1}A^\top\mathbf{b} }[/math] The effect of regularization may be varied via the scale of matrix [math]\displaystyle{ \Gamma }[/math] . For [math]\displaystyle{ \Gamma = 0 }[/math] this reduces to the unregularized least squares solution provided that (ATA)−1 exists. L2 regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines, and matrix factorization.

2011a =

  • (Quadrianto & Buntine, 2011) ⇒ Novi Quadrianto and Wray L. Buntine (2011). "Linear Regression" In: (Sammut & Webb, 2011) pp 747-750.
    • QUOTE: Ridge Regression - The regularization term is in the form of

      [math]\displaystyle{ R(w)=\sum_{d=1}^Dw^2_d \quad\quad }[/math](10)

      Considering [math]\displaystyle{ E(w) }[/math] to be in the form of (1), the regularized least squares quality function is now

      [math]\displaystyle{ (Xw−y)^T(Xw−y)+λw^Tw \quad\quad }[/math](11)

      Since the additional term is a quadratic of [math]\displaystyle{ w }[/math], the regularized objective function is still quadratic in [math]\displaystyle{ w }[/math], thus the optimal solution is unique and can be found in closed form. As before, setting the first derivative of (11) with respect to [math]\displaystyle{ w }[/math] to zero, the optimal weight vector is in the form of

      [math]\displaystyle{ \partial wEreg(w)=2X^T(Xw−y)+\lambda w=0\quad\quad }[/math](12)

      [math]\displaystyle{ w∗=(X^TX+\lambda I)^{−1}X^Ty\quad\quad }[/math](13)

      The effect of the regularization term is to put a small weight for those basis functions which are useful only in a minor way as the penalty for small weights is very small.

2011b

  • (Zhang) ⇒ Xinhua Zhang (2017)"Regularization" In: "Encyclopedia of Machine Learning and Data Mining" (Sammut & Webb, 2017), Springer US, Boston MA, pp 1083-1088. [ISBN:978-1-4899-7687-1], DOI:10.1007/978-1-4899-7687-1_718
    • QUOTE: Ridge regression is illustrative of the use of regularization. It tries to fit the label [math]\displaystyle{ y }[/math] by a linear model [math]\displaystyle{ \langle w,x \rangle }[/math] (inner product). So we need to solve a system of linear equations in [math]\displaystyle{ w: (x_1,\cdots, x_n)^T w= y }[/math], which is equivalent to a linear least square problem: [math]\displaystyle{ min_{w\in\mathcal{R}^p} \parallel X^Tw-y \parallel^2 }[/math]. If the rank of [math]\displaystyle{ X }[/math] is less than the dimension of [math]\displaystyle{ w }[/math], then it is overdetermined and the solution is not unique.

      To approach this ill-posed problem, one needs to introduce additional assumptions on what models are preferred, i.e., the regularizer. One choice is to pick a matrix [math]\displaystyle{ \Gamma }[/math] and regularize [math]\displaystyle{ w }[/math] by [math]\displaystyle{ \parallel \Gamma w \parallel }[/math]. As a result we solve [math]\displaystyle{ min_{w\in\mathcal{R}^p} \parallel X^Tw-y \parallel ^2+\lambda \parallel \Gamma^T w\parallel^2 }[/math], and the solution has a closed form [math]\displaystyle{ w^*=(XX^T+\lambda \Gamma\Gamma^T)Xy }[/math] can be simply the identity matrix which encodes our preference for small norm models. The use of regularization can also be justified from a Bayesian point of view. Treating [math]\displaystyle{ \exp\left(-\parallel X^TW-y \parallel^2\right) }[/math] as a multivariate random variable and the likelihood as [math]\displaystyle{ }[/math], then the minimizer of [math]\displaystyle{ \parallel X^TW-y \parallel^2 }[/math] is just a maximum likelihood estimate of [math]\displaystyle{ w }[/math]. However, we may also assume a prior distribution over [math]\displaystyle{ w }[/math], e.g., a Gaussian prior [math]\displaystyle{ \exp\left(-\parallel\lambda^TW\parallel^2\right) }[/math]. Then the solution of the ridge regression is simply the maximum a posterior estimate of [math]\displaystyle{ w }[/math].