Regularized Learning Algorithm: Difference between revisions

Latest revision as of 17:07, 1 June 2024

A Regularized Learning Algorithm is a supervised model-based learning algorithm that ...

Context:
- It can be applied by a regularized learning system (that can solve a regularized learning task which optimizes a penalized cost function with a regularization penalty parameter).
Example(s):
Counter-Example(s):
- Decision Tree Pruning Algorithm.
See: Elastic Net Regularization, Loss Function, Cross-Validation (Statistics), Hyperparameter Optimization, Ridge Regression, Regularizer (Norm/P-Norm); Minimum description Length; Model Evaluation; Statistical Learning Theory; VC Dimension.

References

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Regularization_(mathematics) Retrieved:2017-12-19.
- In mathematics, statistics, and computer science, particularly in the fields of machine learning and inverse problems, regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting .

2016

(Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Regularization_(mathematics)#Regularization_in_statistics_and_machine_learning Retrieved:2016-1-11.
- In statistics and machine learning, regularization methods are used for model selection, in particular to prevent overfitting by penalizing models with extreme parameter values. The most common variants in machine learning are $L ₁$ and $L ₂$ regularization, which can be added to learning algorithms that minimize a loss function $E(X, Y)$ by instead minimizing $E(X, Y) + α‖ w ‖$ , where is the model's weight vector, ‖·‖ is either the $L ₁$ norm or the squared $L ₂$ norm, and α is a free parameter that needs to be tuned empirically (typically by cross-validation; see hyperparameter optimization). This method applies to many models. When applied in linear regression, the resulting models are termed ridge regression or lasso, but regularization is also employed in (binary and multiclass) logistic regression, neural nets, support vector machines, conditional random fields and some matrix decomposition methods. $L ₂$ regularization may also be called "weight decay", in particular in the setting of neural nets.
  $L ₁$ regularization is often preferred because it produces sparse models and thus performs feature selection within the learning algorithm, but since the $L ₁$ norm is not differentiable, it may require changes to learning algorithms, in particular gradient-based learners. Bayesian learning methods make use of a prior probability that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion (AIC), minimum description length (MDL), and the Bayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation. Regularization can be used to fine tune model complexity using an augmented error function with cross-validation. The data sets used in complex models can produce a levelling-off of validation as complexity of the models increases. Training data sets errors decrease while the validation data set error remains constant. Regularization introduces a second factor which weights the penalty against more complex models with an increasing variance in the data errors. This gives an increasing penalty as model complexity increases. Examples of applications of different methods of regularization to the linear model are:

Model	Fit measure	Entropy measure^[1]^[2]
AIC/BIC	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_0 }[/math]
Ridge regression	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_2 }[/math]
Lasso^[3]	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
Basis pursuit denoising	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\beta\\|_1 }[/math]
Rudin-Osher-Fatemi model (TV)	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\nabla\beta\\|_1 }[/math]
Potts model	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\nabla\beta\\|_0 }[/math]
RLAD^[4]	[math]\displaystyle{ \\|Y-X\beta\\|_1 }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
Dantzig Selector^[5]	[math]\displaystyle{ \\|X^\top (Y-X\beta)\\|_\infty }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
SLOPE^[6]	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \sum_{i=1}^p \lambda_i\|\beta\|_{(i)} }[/math]

A linear combination of the LASSO and ridge regression methods is elastic net regularization.

↑ Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN 978-0387310732.
↑ Duda, Richard O. (2004). Pattern classification + computer manual : hardcover set (2. ed. ed.). New York [u.a.]: Wiley. ISBN 978-0471703501.
↑ Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso" (PostScript). Journal of the Royal Statistical Society, Series B 58 (1): 267–288. MR 1379242. http://www-stat.stanford.edu/~tibs/ftp/lasso.ps. Retrieved 2009-03-19.
↑ Template:Cite conference
↑ Candes, Emmanuel; Tao, Terence (2007). "The Dantzig selector: Statistical estimation when p is much larger than n". Annals of Statistics 35 (6): 2313–2351. arXiv:math/0506081. doi:10.1214/009053606000001523. MR 2382644.
↑ Małgorzata Bogdan, Ewout van den Berg, Weijie Su & Emmanuel J. Candes (2013). "Statistical estimation and testing via the ordered L1 norm". arXiv preprint arXiv:1310.1969. arXiv:1310.1969v2. http://arxiv.org/pdf/1310.1969v2.pdf.

2015

https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization/answer/Justin-Solomon
- QUOTE: ... you can view regularization as a prior on the distribution from which your data is drawn (most famously Gaussian for least-squares), as a way to punish high values in regression coefficients, and so on.
Compressibility and K-term approximation http://cnx.org/contents/U4hLPGQD@5/Compressible-signals#uid10
- QUOTE: A signal's compressibility is related to the ℓp space to which the signal belongs. An infinite sequence x(n) is an element of an ℓp space for a particular value of p if and only if its ℓp norm is finite: [math]\displaystyle{ ∥x∥p=(∑i|xi|p)1p\lt ∞ }[/math]
  The smaller p is, the faster the sequence's values must decay in order to converge so that the norm is bounded. In the limiting case of p=0, the “norm” is actually a pseudo-norm and counts the number of non-zero values. As p decreases, the size of its corresponding ℓp space also decreases. Figure shows various ℓp unit balls (all sequences whose ℓp norm is 1) in 3 dimensions.
  
  As the value of p decreases, the size of the corresponding ℓp space also decreases. This can be seen visually when comparing the the size of the spaces of signals, in three dimensions, for which the ℓp norm is less than or equal to one. The volume of these ℓp “balls” decreases with p.

2011

(Zhang, 2011e) ⇒ Xinhua Zhang. (2011). “Regularization.” In: (Sammut & Webb, 2011) p.845
- QUOTE: Regularization plays a key role in many machine learning algorithms. Exactly fitting a model to the training data is generally undesirable, because it will fit the noise in the training examples (overfitting), and is doomed to predict (generalize) poorly on unseen data. In contrast, a simple model that fits the training data well is more likely to capture the regularities in it and generalize well. So a regularizer is introduced to quantify the complexity of a model, and many successful machine learning algorithms fall in the framework of regularized risk minimization:
  - (Howwellthemodelfitsthetrainingdata) (1)
  - +λ⋅(complexity/regularizationofthemodel), (2)
- where the positive real number λ controls the tradeoff.
  There is a variety of regularizers, which yield different statistical and computational properties. In general, there is no universally best regularizer, and a regularization approach must be chosen depending on ...

2007

(Schraudolph et al., 2007) ⇒ Nicol Schraudolph, Jin Yu and Simon Guenter. (2007). “A Stochastic Quasi-Newton Method for Online Convex Optimization.” In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AIstats 2007).
- QUOTE: It has been argued that stochastic approximation acts as a regularizer (Neuneier and Zimmermann, 1998, p. 397); our results illustrate how the utility of this effect depends on the particular stochastic gradient method used.

2004

(Hastie et al., 2004) ⇒ Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. (2004). “The Entire Regularization Path for the Support Vector Machine.” In: The Journal of Machine Learning Research, 5.

1996

(Neuneier & Zimmermann, 1996) ⇒ Ralph Neuneier, and Hans-Georg Zimmermann. (1996). “How to Train Neural Networks.” In: Proceeding Neural Networks: Tricks of the Trade

[1] Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN 978-0387310732.

[2] Duda, Richard O. (2004). Pattern classification + computer manual : hardcover set (2. ed. ed.). New York [u.a.]: Wiley. ISBN 978-0471703501.

[3] Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the Lasso" (PostScript). Journal of the Royal Statistical Society, Series B 58 (1): 267–288. MR 1379242. http://www-stat.stanford.edu/~tibs/ftp/lasso.ps. Retrieved 2009-03-19.

[4] Template:Cite conference

[5] Candes, Emmanuel; Tao, Terence (2007). "The Dantzig selector: Statistical estimation when p is much larger than n". Annals of Statistics 35 (6): 2313–2351. arXiv:math/0506081. doi:10.1214/009053606000001523. MR 2382644.

[6] Małgorzata Bogdan, Ewout van den Berg, Weijie Su & Emmanuel J. Candes (2013). "Statistical estimation and testing via the ordered L1 norm". arXiv preprint arXiv:1310.1969. arXiv:1310.1969v2. http://arxiv.org/pdf/1310.1969v2.pdf.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 26: / Line 26: @@
 === 2016 ===
 * (Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Regularization_(mathematics)#Regularization_in_statistics_and_machine_learning Retrieved:2016-1-11.
-** In statistics and [[machine learning]], regularization methods are used for model selection, in particular to prevent [[overfitting]] by penalizing models with extreme parameter values. The most common variants in machine learning are {{math|''L''₁}} and {{math|''L''₂}} regularization, which can be added to learning algorithms that minimize a [[loss function]] {{math|E(''X'', ''Y'')}} by instead minimizing {{math|E(''X'', ''Y'') + α‖''w''‖}}, where is the model's weight vector, ‖·‖ is either the {{math|''L''₁}} norm or the squared {{math|''L''₂}} norm, and α is a free parameter that needs to be tuned empirically (typically by [[Cross-validation (statistics)|cross-validation]]; see [[hyperparameter optimization]]). This method applies to many models. When applied in [[linear regression]], the resulting models are termed [[ridge regression]] or [[Least squares#Lasso method|lasso]], but regularization is also employed in (binary and [[multinomial logistic regression|multiclass]]) [[logistic regression]], [[artificial neural network|neural nets]], [[support vector machine]]s, [[conditional random field]]s and some [[matrix decomposition method]]s. {{math|''L''₂}} regularization may also be called "weight decay", in particular in the setting of neural nets.         <P>        {{math|''L''₁}} regularization is often preferred because it produces sparse models and thus performs [[feature selection]] within the learning algorithm, but since the {{math|''L''₁}} norm is not differentiable, it may require changes to learning algorithms, in particular gradient-based learners.  [[Bayesian model comparison|Bayesian learning method]]s make use of a [[prior probability]] that (usually) gives lower probability to more complex models. Well-known model selection techniques include the [[Akaike information criterion]] (AIC), [[minimum description length]] (MDL), and the [[Bayesian information criterion]] (BIC). Alternative methods of controlling overfitting not involving regularization include [[cross-validation (statistics)|cross-validation]]. Regularization can be used to fine tune model complexity using an augmented error function with cross-validation. The data sets used in complex models can produce a levelling-off of validation as complexity of the models increases. Training data sets errors decrease while the validation data set error remains constant. Regularization introduces a second factor which weights the penalty against more complex models with an increasing variance in the data errors. This gives an increasing penalty as model complexity increases.  Examples of applications of different methods of regularization to the [[linear model]] are:
+** In statistics and [[machine learning]], regularization methods are used for model selection, in particular to prevent [[overfitting]] by penalizing models with extreme parameter values. The most common variants in machine learning are {{math|''L''₁}} and {{math|''L''₂}} regularization, which can be added to learning algorithms that minimize a [[loss function]] {{math|E(''X'', ''Y'')}} by instead minimizing {{math|E(''X'', ''Y'') + α‖''w''‖}}, where is the model's weight vector, ‖·‖ is either the {{math|''L''₁}} norm or the squared {{math|''L''₂}} norm, and α is a free parameter that needs to be tuned empirically (typically by [[Cross-validation (statistics)|cross-validation]]; see [[hyperparameter optimization]]). This method applies to many models. When applied in [[linear regression]], the resulting models are termed [[ridge regression]] or [[Least squares#Lasso method|lasso]], but regularization is also employed in (binary and [[multinomial logistic regression|multiclass]]) [[logistic regression]], [[artificial neural network|neural nets]], [[support vector machine]]s, [[conditional random field]]s and some [[matrix decomposition method]]s. {{math|''L''₂}} regularization may also be called "weight decay", in particular in the setting of neural nets.         <P>        {{math|''L''₁}} regularization is often preferred because it produces sparse models and thus performs [[feature selection]] within the learning algorithm, but since the {{math|''L''₁}} norm is not differentiable, it may require changes to learning algorithms, in particular gradient-based learners. [[Bayesian model comparison|Bayesian learning method]]s make use of a [[prior probability]] that (usually) gives lower probability to more complex models. Well-known model selection techniques include the [[Akaike information criterion]] (AIC), [[minimum description length]] (MDL), and the [[Bayesian information criterion]] (BIC). Alternative methods of controlling overfitting not involving regularization include [[cross-validation (statistics)|cross-validation]]. Regularization can be used to fine tune model complexity using an augmented error function with cross-validation. The data sets used in complex models can produce a levelling-off of validation as complexity of the models increases. Training data sets errors decrease while the validation data set error remains constant. Regularization introduces a second factor which weights the penalty against more complex models with an increasing variance in the data errors. This gives an increasing penalty as model complexity increases.  Examples of applications of different methods of regularization to the [[linear model]] are:
 ::{|class="wikitable sortable"
 !Model!!Fit measure!!Entropy measure<ref>{{cite book|last1=Bishop|first1=Christopher M.|title=Pattern recognition and machine learning|date=2007|publisher=Springer|location=New York|isbn=978-0387310732|edition=Corr. printing.}}</ref><ref>{{cite book|last1=Duda|first1=Richard O.|title=Pattern classification + computer manual : hardcover set|date=2004|publisher=Wiley|location=New York [u.a.]|isbn=978-0471703501|edition=2. ed.}}</ref>

Model	Fit measure	Entropy measure^[1]^[2]
AIC/BIC	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_0 }[/math]
Ridge regression	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_2 }[/math]
Lasso^[3]	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
Basis pursuit denoising	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\beta\\|_1 }[/math]
Rudin-Osher-Fatemi model (TV)	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\nabla\beta\\|_1 }[/math]
Potts model	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \lambda\\|\nabla\beta\\|_0 }[/math]
RLAD^[4]	[math]\displaystyle{ \\|Y-X\beta\\|_1 }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
Dantzig Selector^[5]	[math]\displaystyle{ \\|X^\top (Y-X\beta)\\|_\infty }[/math]	[math]\displaystyle{ \\|\beta\\|_1 }[/math]
SLOPE^[6]	[math]\displaystyle{ \\|Y-X\beta\\|_2 }[/math]	[math]\displaystyle{ \sum_{i=1}^p \lambda_i\|\beta\|_{(i)} }[/math]

Regularized Learning Algorithm: Difference between revisions

Latest revision as of 17:07, 1 June 2024

References

2017

2016

2015

2011

2007

2004

1996

Navigation menu

Search