Generalized Linear Model (GLM)

From GM-RKB
(Redirected from Generalized linear model)
Jump to navigation Jump to search

A Generalized Linear Model (GLM) is a fixed effects statistical model which assumes that the dependent variable has an error distribution model other than a normal distribution and is generated by a linear combination of basis functions.



References

2017a

2017b

  • (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/generalized_linear_model#Intuition Retrieved:2017-2-1.
    • Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable has a normal distribution (intuitively, when a response variable can vary essentially indefinitely in either direction with no fixed "zero value", or more generally for any quantity that only varies by a relatively small amount, e.g. human heights).

      However, these assumptions are inappropriate for some types of response variables. For example, in cases where the response variable is expected to be always positive and varying over a wide range, constant input changes lead to geometrically varying, rather than constantly varying, output changes. As an example, a prediction model might predict that 10 degree temperature decrease would lead to 1,000 fewer people visiting the beach is unlikely to generalize well over both small beaches (e.g. those where the expected attendance was 50 at a particular temperature) and large beaches (e.g. those where the expected attendance was 10,000 at a low temperature). The problem with this kind of prediction model would imply a temperature drop of 10 degrees would lead to 1,000 fewer people visiting the beach, a beach whose expected attendance was 50 at a higher temperature would now be predicted to have the impossible attendance value of −950. Logically, a more realistic model would instead predict a constant rate of increased beach attendance (e.g. an increase in 10 degrees leads to a doubling in beach attendance, and a drop in 10 degrees leads to a halving in attendance). Such a model is termed an exponential-response model (or log-linear model, since the logarithm of the response is predicted to vary linearly).

      Similarly, a model that predicts a probability of making a yes/no choice (a Bernoulli variable) is even less suitable as a linear-response model, since probabilities are bounded on both ends (they must be between 0 and 1). Imagine, for example, a model that predicts the likelihood of a given person going to the beach as a function of temperature. A reasonable model might predict, for example, that a change in 10 degrees makes a person two times more or less likely to go to the beach. But what does "twice as likely" mean in terms of a probability? It cannot literally mean to double the probability value (e.g. 50% becomes 100%, 75% becomes 150%, etc.). Rather, it is the odds that are doubling: from 2:1 odds, to 4:1 odds, to 8:1 odds, etc. Such a model is a log-odds model.

      Generalized linear models cover all these situations by allowing for response variables that have arbitrary distributions (rather than simply normal distributions), and for an arbitrary function of the response variable (the link function) to vary linearly with the predicted values (rather than assuming that the response itself must vary linearly). For example, the case above of predicted number of beach attendees would typically be modeled with a Poisson distribution and a log link, while the case of predicted probability of beach attendance would typically be modeled with a Bernoulli distribution (or binomial distribution, depending on exactly how the problem is phrased) and a log-odds (or logit) link function.

2017c

  • (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/generalized_linear_model#Overview Retrieved:2017-2-1.
    • In a generalized linear model (GLM), each outcome Y of the dependent variables is assumed to be generated from a particular distribution in the exponential family, a large range of probability distributions that includes the normal, binomial, Poisson and gamma distributions, among others. The mean, μ, of the distribution depends on the independent variables, X, through: : [math]\displaystyle{ \operatorname{E}(\mathbf{Y}) = \boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\beta}) }[/math] where E(Y) is the expected value of Y ; Xβ is the linear predictor, a linear combination of unknown parameters β ; g is the link function.

      In this framework, the variance is typically a function, V, of the mean: : [math]\displaystyle{ \operatorname{Var}(\mathbf{Y}) = \operatorname{V}( \boldsymbol{\mu} ) = \operatorname{V}(g^{-1}(\mathbf{X}\boldsymbol{\beta})). }[/math] It is convenient if V follows from the exponential family distribution, but it may simply be that the variance is a function of the predicted value.

      The unknown parameters, β, are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques.

2016

  • (Zhang et al., 2016) ⇒ XianXing Zhang, Yitong Zhou, Yiming Ma, Bee-Chung Chen, Liang Zhang, and Deepak Agarwal. (2016). “GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction.” In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ISBN:978-1-4503-4232-2 doi:10.1145/2939672.2939684
    • QUOTE: Generalized linear model (GLM) is a widely used class of models for statistical inference and response prediction problems. For instance, in order to recommend relevant content to a user or optimize for revenue, many web companies use logistic regression models to predict the probability of the user's clicking on an item (e.g., ad, news article, job). In scenarios where the data is abundant, having a more fine-grained model at the user or item level would potentially lead to more accurate prediction, as the user's personal preferences on items and the item's specific attraction for users can be better captured. One common approach is to introduce ID-level regression coefficients in addition to the global regression coefficients in a GLM setting, and such models are called generalized linear mixed models (GLMix) in the statistical literature.

2005

  • (Everitt & Howell, 2005) ⇒ Brian S. Everitt, and D. Howell. (2005). “Encyclopedia of Statistics in Behavioral Science." Wiley. ISBN:9780470860809
    • http://www.wiley.com/legacy/wileychi/eosbs/pdfs/bsa251.pdf
    • QUOTE: Generalized linear models (GLMs) represent a class of fixed effects regression models for several types of dependent variables (i.e., continuous, dichotomous, counts). McCullagh and Nelder [32] describe these in great detail and indicate that the term ‘generalized linear model’ is due to Nelder and Wedderburn [35] who described how a collection of seemingly disparate statistical techniques could be unified. Common Generalized linear models (GLMs) include linear regression, logistic regression, and Poisson regression.

      There are three specifications in a GLM. First, the linear predictor, denoted as [math]\displaystyle{ \eta_i }[/math], of a GLM is of the form :[math]\displaystyle{ \eta_i = x^\prime_i\beta, (1) }[/math] where [math]\displaystyle{ x_i }[/math] is the vector of regressors for unit i with fixed effects ß. Then, a link function g(·) is specified which converts the expected value [math]\displaystyle{ µ_i }[/math] of the outcome variable Yi (i.e., [math]\displaystyle{ \mu_i = E[Y_i] }[/math]) to the linear predictor ?i :[math]\displaystyle{ g(\mu_i ) = \eta_i. (2) }[/math]

      Finally, a specification for the form of the variance in terms of the mean µi is made. The latter two specifications usually depend on the distribution of the outcome Yi , which is assumed to fall within the exponential family of distributions. Fixed effects models, which assume that all observations are independent of each other, are not appropriate for analysis of several types of correlated data structures, in particular, for clustered and/or longitudinal data (see Clustered Data). In clustered designs, subjects are observed nested within larger units, for example, schools, hospitals, neighborhoods, workplaces, and so on. In longitudinal designs, repeated observations are nested within subjects (see Longitudinal Data Analysis and Repeated Measures Analysis of Variance). These are often referred to as multilevel [16] or hierarchical [41] data (see Linear Reproduced from the Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, Ltd. ISBN: 0-470-86080-4. Multilevel Models), in which the level-1 observations (subjects or repeated observations) are nested within the higher level-2 observations (clusters or subjects). Higher levels are also possible, for example, a three-level design could have repeated observations (level-1) nested within subjects (level-2) who are nested within clusters (level-3). For analysis of such multilevel data, random cluster and/or subject effects can be added into the regression model to account for the correlation of the data. The resulting model is a mixed model including the usual fixed effects for the regressors plus the random effects. Mixed models for continuous normal outcomes have been extensively developed since the seminal paper by Laird and Ware [28]. For nonnormal data, there have also been many developments, some of which are described below. Many of these developments fall under the rubric of generalized linear mixed models (GLMMs), which extend GLMs by the inclusion of random effects in the predictor. Agresti et al. [1] describe a variety of social science applications of GLMMs; [12], [33], and [11] are recent texts with a wealth of statistical material on GLMMs.