Generalized Additive Model (GAM) Fitting Algorithm

A Generalized Additive Model (GAM) Fitting Algorithm is an additive model algorithm that properties of generalized linear models with additive models.

References

2019

• (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model Retrieved:2019-9-13.
• In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

GAMs were originally developed by Trevor Hastie and Robert Tibshirani[1] to blend properties of generalized linear models with additive models.

The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as : $g(\operatorname{E}(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\!$ The functions fi may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

2017

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model Retrieved:2017-10-17.
• In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

GAMs were originally developed by Trevor Hastie and Robert Tibshirani[1] to blend properties of generalized linear models with additive models.

The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as : $g(\operatorname{E}(Y))=\beta_0 + f_1(x_1) + f_2(x_2)+ \cdots + f_m(x_m).\,\!$ The functions fi may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

1. Cite error: Invalid <ref> tag; no text was provided for refs named Hastie1990

2017b

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Generalized_additive_model#GAM_fitting_methods Retrieved:2017-10-17.
• The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the backfitting algorithm[1]. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the $f_j(x_j)$ terms. A disadvantage of backfitting is that it is difficult to integrate with the estimation of the degree of smoothness of the model terms, so that in practice the user must set these, or select between a modest set of pre-defined smoothing levels.

If the $f_j(x_j)$ are represented using smoothing splines[2] then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by REML (sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects [3] . This full spline approach carries an $O(n^3)$ computational cost, where $n$ is the number of observations for the response variable, rendering it somewhat impractical for moderately large datasets. More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction)[4] [5]

[7] [8] or by finding sparse representations of the smooths using Markov random fields, which are amenable to the use of sparse matrix methods for computation [9] . These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components. Estimating the degree of smoothness via REML can be viewed as an empirical Bayes method.

An alternative approach with particular advantages in high dimensional settings is to use boosting (machine learning), although this typically requires bootstrapping for uncertainty quantification[10]

[11] .

1. Cite error: Invalid <ref> tag; no text was provided for refs named Hastie1990
2. Cite error: Invalid <ref> tag; no text was provided for refs named Wahba1990
3. Cite error: Invalid <ref> tag; no text was provided for refs named Gu1991
4. Cite error: Invalid <ref> tag; no text was provided for refs named Wood2000
5. Cite error: Invalid <ref> tag; no text was provided for refs named Fahrmeier2001
6. Cite error: Invalid <ref> tag; no text was provided for refs named kim2004
7. Cite error: Invalid <ref> tag; no text was provided for refs named Wood2017
8. Cite error: Invalid <ref> tag; no text was provided for refs named Rue2009
9. Cite error: Invalid <ref> tag; no text was provided for refs named mboost
10. Cite error: Invalid <ref> tag; no text was provided for refs named mayr2012

2004

• (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
• QUOTE:
• In supervised classification, inputs $x$ and their labels $y$ arise from an unknown joint probability p(x ; y). If we can approximate p(x,y) using a parametric family of models $G$ = {pθ(x,y),θ in Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.
• However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density $p(y \vert x)$. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
• Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its pθ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.

1986

 year 2009 +