Regularized Learning Algorithm
A Regularized Learning Algorithm is a supervised learning algorithm that optimizes a penalized likelihood function (that uses a regularization penalty parameter).
- AKA: Regularization (Mathematics).
- Context:
- It can range from being an L1 Regularized Supervised Learning Algorithm to being an L2 Regularized Supervised Learning Algorithm.
- See: Lasso, Support Vector Machine, Gradient Descent Algorithm, Elastic Net Regularization, Overfitting, Loss Function, Cross-Validation (Statistics), Hyperparameter Optimization, Ridge Regression, Least Squares Lasso Method, Multinomial Logistic Regression, Logistic Regression.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Regularization_(mathematics)#Regularization_in_statistics_and_machine_learning Retrieved:2015-5-11.
- In statistics and machine learning, regularization methods are used for model selection, in particular to prevent overfitting by penalizing models with extreme parameter values. The most common variants in machine learning are L₁ and L₂ regularization, which can be added to learning algorithms that minimize a loss function E(X, Y) by instead minimizing E(X, Y) + α‖w‖, where is the model's weight vector, ‖·‖ is either the L₁ norm or the squared L₂ norm, and α is a free parameter that needs to be tuned empirically (typically by cross-validation; see hyperparameter optimization). This method applies to many models. When applied in linear regression, the resulting models are termed ridge regression or lasso, but regularization is also employed in (binary and multiclass) logistic regression, neural nets, support vector machines, conditional random fields and some matrix decomposition methods. L₂ regularization may also be called "weight decay", in particular in the setting of neural nets.
L₁ regularization is often preferred because it produces sparse models and thus performs feature selection within the learning algorithm, but since the L₁ norm is not differentiable, it may require changes to learning algorithms, in particular gradient-based learners. Bayesian learning methods make use of a prior probability that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the Bayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation. Regularization can be used to fine tune model complexity using an augmented error function with cross-validation. The data sets used in complex models can produce a levelling-off of validation as complexity of the models increases. Training data sets errors decrease while the validation data set error remains constant. Regularization introduces a second factor which weights the penalty against more complex models with an increasing variance in the data errors. This gives an increasing penalty as model complexity increases.[1] Examples of applications of different methods of regularization to the linear model are: {|class="wikitable sortable"
!Model!!Fit measure!!Entropy measure |- |AIC/BIC|| [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \|\beta\|_0 }[/math] |- |Ridge regression || [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \|\beta\|_2 }[/math] |- |Lasso
| [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \|\beta\|_1 }[/math] |-
|Basis pursuit denoising || [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \lambda\|\beta\|_1 }[/math] |-
|Rudin-Osher-Fatemi model (TV) || [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \lambda\|\nabla\beta\|_1 }[/math] |-
| Potts model || [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \lambda\|\nabla\beta\|_0 }[/math] |-
|RLAD | [math]\displaystyle{ \|Y-X\beta\|_1 }[/math] || [math]\displaystyle{ \|\beta\|_1 }[/math] |- |Dantzig Selector | [math]\displaystyle{ \|X^\top (Y-X\beta)\|_\infty }[/math] || [math]\displaystyle{ \|\beta\|_1 }[/math] |- |SLOPE
| [math]\displaystyle{ \|Y-X\beta\|_2 }[/math] || [math]\displaystyle{ \sum_{i=1}^p \lambda_i|\beta|_{(i)} }[/math] |}
A linear combination of the LASSO and ridge regression methods is elastic net regularization.
- In statistics and machine learning, regularization methods are used for model selection, in particular to prevent overfitting by penalizing models with extreme parameter values. The most common variants in machine learning are L₁ and L₂ regularization, which can be added to learning algorithms that minimize a loss function E(X, Y) by instead minimizing E(X, Y) + α‖w‖, where is the model's weight vector, ‖·‖ is either the L₁ norm or the squared L₂ norm, and α is a free parameter that needs to be tuned empirically (typically by cross-validation; see hyperparameter optimization). This method applies to many models. When applied in linear regression, the resulting models are termed ridge regression or lasso, but regularization is also employed in (binary and multiclass) logistic regression, neural nets, support vector machines, conditional random fields and some matrix decomposition methods. L₂ regularization may also be called "weight decay", in particular in the setting of neural nets.
- ↑ Cite error: Invalid
<ref>
tag; no text was provided for refs namedIMLAMit1
2007
- (Schraudolph et al., 2007) ⇒ Nicol Schraudolph, Jin Yu and Simon Guenter. (2007). "A Stochastic Quasi-Newton Method for Online Convex Optimization." In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AIstats 2007).
- QUOTE: It has been argued that stochastic approximation acts as a regularizer (Neuneier and Zimmermann, 1998, p. 397); our results illustrate how the utility of this effect depends on the particular stochastic gradient method used.
1996
- (Neuneier & Zimmermann, 1996) => Ralph Neuneier, and Hans-Georg Zimmermann. (1996). "How to Train Neural Networks." In: Proceeding Neural Networks: Tricks of the Trade