# Difference between revisions of "Generalized Additive Model (GAM) Fitting Algorithm"

(3 intermediate revisions by 2 users not shown) | |||

Line 2: | Line 2: | ||

* <B>Context:</B> | * <B>Context:</B> | ||

** It can (typically) specify a [[Distribution Function]], a [[Link Function]], and ... | ** It can (typically) specify a [[Distribution Function]], a [[Link Function]], and ... | ||

+ | ** It can be implemented by a [[GAM Training System]]. | ||

* <B>Example(s):</B> | * <B>Example(s):</B> | ||

** one proposed in ([[Hastie & Tibshirani, 1986]]). | ** one proposed in ([[Hastie & Tibshirani, 1986]]). | ||

Line 9: | Line 10: | ||

== References == | == References == | ||

+ | |||

+ | === 2019 === | ||

+ | * (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model Retrieved:2019-9-13. | ||

+ | ** In [[statistics]], a '''generalized additive model (GAM)''' is a [[generalized linear model]] in which the linear predictor depends linearly on unknown [[smooth function]]s of some predictor variables, and interest focuses on inference about these smooth functions. <P> GAMs were originally developed by [[Trevor Hastie]] and [[Robert Tibshirani]]<ref name=Hastie1990></ref> to blend properties of [[generalized linear model]]s with [[additive model]]s. <P> The model relates a univariate response variable, ''Y'', to some predictor variables, ''x''<sub>''i''</sub>. An [[exponential family]] distribution is specified for Y (for example [[normal distribution|normal]], [[binomial distribution|binomial]] or [[Poisson distribution|Poisson]] distributions) along with a [[link function]] ''g'' (for example the identity or log functions) relating the expected value of ''Y'' to the predictor variables via a structure such as : <math> g(\operatorname{E}(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\! </math> The functions ''f''<sub>''i''</sub> may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by [[Nonparametric regression|non-parametric means]]. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for ''f''<sub>1</sub>(''x''<sub>1</sub>), and then use a factor model for ''f''<sub>2</sub>(''x''<sub>2</sub>). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability. | ||

=== 2017 === | === 2017 === | ||

Line 19: | Line 24: | ||

** The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the [[backfitting algorithm]]<ref name=Hastie1990 />. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the <math> f_j(x_j) </math> terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels. <P> If the <math> f_j(x_j) </math> are represented using [[smoothing spline]]s<ref name=Wahba1990></ref> then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by [[REML]] (sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects <ref name=Gu1991></ref> . This full spline approach carries an <math> O(n^3) </math> computational cost, where <math> n </math> is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction)<ref name=Wood2000></ref> <ref name=Fahrmeier2001></ref> <P> <ref name=kim2004></ref> <P> <ref name=Wood2017></ref> <ref name=Ruppert2003> <P> </ref> or by finding sparse representations of the smooths using [[Markov random field]]s, which are amenable to the use of [[sparse matrix method]]s for computation <ref name=Rue2009></ref> . These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an [[empirical Bayes method]]. <P> An alternative approach with particular advantages in high dimensional settings is to use [[boosting (machine learning)]], although this typically requires bootstrapping for uncertainty quantification<ref name=mboost></ref> <P> <ref name=mayr2012></ref> . | ** The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the [[backfitting algorithm]]<ref name=Hastie1990 />. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the <math> f_j(x_j) </math> terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels. <P> If the <math> f_j(x_j) </math> are represented using [[smoothing spline]]s<ref name=Wahba1990></ref> then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by [[REML]] (sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects <ref name=Gu1991></ref> . This full spline approach carries an <math> O(n^3) </math> computational cost, where <math> n </math> is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction)<ref name=Wood2000></ref> <ref name=Fahrmeier2001></ref> <P> <ref name=kim2004></ref> <P> <ref name=Wood2017></ref> <ref name=Ruppert2003> <P> </ref> or by finding sparse representations of the smooths using [[Markov random field]]s, which are amenable to the use of [[sparse matrix method]]s for computation <ref name=Rue2009></ref> . These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an [[empirical Bayes method]]. <P> An alternative approach with particular advantages in high dimensional settings is to use [[boosting (machine learning)]], although this typically requires bootstrapping for uncertainty quantification<ref name=mboost></ref> <P> <ref name=mayr2012></ref> . | ||

<references/> | <references/> | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

=== 2009 === | === 2009 === | ||

Line 41: | Line 39: | ||

*** However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density <math>p(y \vert x)</math>. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than. | *** However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density <math>p(y \vert x)</math>. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than. | ||

*** Conversely, the generative approach converges to the best model for the [[joint distribution]] ''p''(''x'',''y'') but the resulting conditional density is usually a biased classifier unless its ''p</i><sub>θ</sub>(''x'') part is an accurate model for ''p''(''x''). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that [[minimize]]s the [[negative log-likelihood classification loss]] against the true density p(x, y) [2]. For finite [[sample size]]s, there is a [[bias-variance tradeoff]] and it is less obvious how to choose between generative and discriminative classifiers. | *** Conversely, the generative approach converges to the best model for the [[joint distribution]] ''p''(''x'',''y'') but the resulting conditional density is usually a biased classifier unless its ''p</i><sub>θ</sub>(''x'') part is an accurate model for ''p''(''x''). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that [[minimize]]s the [[negative log-likelihood classification loss]] against the true density p(x, y) [2]. For finite [[sample size]]s, there is a [[bias-variance tradeoff]] and it is less obvious how to choose between generative and discriminative classifiers. | ||

− | |||

− | |||

− | |||

=== 1990 === | === 1990 === | ||

Line 54: | Line 49: | ||

[[Category:Concept]] | [[Category:Concept]] | ||

__NOTOC__ | __NOTOC__ | ||

− | |||

− | |||

− | |||

− |

## Latest revision as of 23:53, 13 September 2019

A Generalized Additive Model (GAM) Fitting Algorithm is an additive model algorithm that properties of generalized linear models with additive models.

**Context:**- It can (typically) specify a Distribution Function, a Link Function, and ...
- It can be implemented by a GAM Training System.

**Example(s):**- one proposed in (Hastie & Tibshirani, 1986).

**See:**Naive Bayes, Nonparametric Regression, Generalized Linear Model, Smooth Function, Backfitting Algorithm, Smoothing Spline, REML, Markov Random Field, Empirical Bayes Method.

## References

### 2019

- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model Retrieved:2019-9-13.
- In statistics, a
**generalized additive model (GAM)**is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.GAMs were originally developed by Trevor Hastie and Robert Tibshirani

^{[1]}to blend properties of generalized linear models with additive models.The model relates a univariate response variable,

*Y*, to some predictor variables,*x*_{i}. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function*g*(for example the identity or log functions) relating the expected value of*Y*to the predictor variables via a structure such as : [math] g(\operatorname{E}(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\! [/math] The functions*f*_{i}may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for*f*_{1}(*x*_{1}), and then use a factor model for*f*_{2}(*x*_{2}). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

- In statistics, a

### 2017

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model Retrieved:2017-10-17.
- In statistics, a
**generalized additive model (GAM)**is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.GAMs were originally developed by Trevor Hastie and Robert Tibshirani

^{[1]}to blend properties of generalized linear models with additive models.The model relates a univariate response variable,

*Y*, to some predictor variables,*x*_{i}. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function*g*(for example the identity or log functions) relating the expected value of*Y*to the predictor variables via a structure such as : [math] g(\operatorname{E}(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\! [/math] The functions*f*_{i}may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for*f*_{1}(*x*_{1}), and then use a factor model for*f*_{2}(*x*_{2}). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

- In statistics, a

- ↑
^{1.0}^{1.1}Cite error: Invalid`<ref>`

tag; no text was provided for refs named`Hastie1990`

### 2017b

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model#GAM_fitting_methods Retrieved:2017-10-17.
- The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the backfitting algorithm
^{[1]}. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the [math] f_j(x_j) [/math] terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels.If the [math] f_j(x_j) [/math] are represented using smoothing splines

^{[2]}then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by REML (sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects^{[3]}. This full spline approach carries an [math] O(n^3) [/math] computational cost, where [math] n [/math] is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction)^{[4]}^{[5]}^{[6]}^{[7]}^{[8]}or by finding sparse representations of the smooths using Markov random fields, which are amenable to the use of sparse matrix methods for computation^{[9]}. These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an empirical Bayes method.An alternative approach with particular advantages in high dimensional settings is to use boosting (machine learning), although this typically requires bootstrapping for uncertainty quantification

^{[10]}^{[11]}.

- The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the backfitting algorithm

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Hastie1990`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Wahba1990`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Gu1991`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Wood2000`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Fahrmeier2001`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`kim2004`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Wood2017`

- ↑
- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Rue2009`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`mboost`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`mayr2012`

### 2009

- (Hastie et al., 2009) ⇒ [[::Trevor Hastie]], [[::Robert Tibshirani]], and [[::Jerome H. Friedman]]. (2009). “The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition)." Springer-Verlag. ISBN:0387848576
- QUOTE: ... We describe five related techniques: generalized additive models, trees, multivariate adaptive regression splines, the patient rule induction method, and hierarchical mixtures of experts. ...

### 2005

- (Hastie, 2005) ⇒ Trevor Hastie, (2005). http://www-stat.stanford.edu/brochure/part4.html#hastie
- Generalized additive models adapt nonparametric regression technology to provide more flexibility to the usual linear models used in applied areas, such as logistic and log-linear models. [[Principal curves] and surfaces generalize linear principal components by allowing nonlinear coordinate functions.

### 2004

- (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
- QUOTE:
- In supervised classification, inputs [math]x[/math] and their labels [math]y[/math] arise from an unknown joint probability
*p*(*x*;*y*). If we can approximate*p*(*x*,*y*) using a parametric family of models [math]G[/math] = {*p*_{θ}(x*,*y*),*θ*in Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called*generative*classification.* - However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]p(y \vert x)[/math]. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
- Conversely, the generative approach converges to the best model for the joint distribution
*p*(*x*,*y*) but the resulting conditional density is usually a biased classifier unless its*p*_{θ}(x*) part is an accurate model for*p*(*x*). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.*

- In supervised classification, inputs [math]x[/math] and their labels [math]y[/math] arise from an unknown joint probability

- QUOTE:

### 1990

- (Hastie & Tibshirani, 1990) ⇒ Trevor Hastie, and Robert Tibshirani. (1990). “Generalized Additive Models." CRC Press. ISBN:0412343908

### 1986

- (Hastie & Tibshirani, 1986) ⇒ Trevor Hastie, and Robert Tibshirani. (1986). “Generalized Additive Models (with discussion).” In: Statistical Science, 1(3).