Deviance Information Criterion

From GM-RKB
Jump to navigation Jump to search

A Deviance Information Criterion is a generalization of the Akaike information criterion and the Bayesian information criterion.



References

2016

Define the deviance as [math]\displaystyle{ D(\theta)=-2 \log(p(y|\theta))+C\, }[/math], where [math]\displaystyle{ y\, }[/math] are the data, [math]\displaystyle{ \theta\, }[/math] are the unknown parameters of the model and [math]\displaystyle{ p(y|\theta)\, }[/math] is the likelihood function. [math]\displaystyle{ C\, }[/math] is a constant that cancels out in all calculations that compare different models, and which therefore does not need to be known.
The expectation [math]\displaystyle{ \bar{D}=\mathbf{E}^\theta[D(\theta)] }[/math] is a measure of how well the model fits the data; the larger this is, the worse the fit.
There are two calculations in common usage for the effective number of parameters of the model. The first, as described in Spiegelhalter et al. (2002, p.587) is [math]\displaystyle{ p_D=\bar{D}-D(\bar{\theta}) }[/math], where [math]\displaystyle{ \bar{\theta} }[/math] is the expectation of [math]\displaystyle{ \theta\, }[/math]. The second, as described in Gelman et al. (2004, p.182) is [math]\displaystyle{ p_D = p_V = \frac{1}{2}\widehat{\operatorname{var}}\left(D(\theta)\right) }[/math]. The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.
The deviance information criterion is calculated as
[math]\displaystyle{ \mathit{DIC} = p_D+\bar{D}, }[/math]
or equivalently as
[math]\displaystyle{ \mathit{DIC} = D(\bar{\theta})+2 p_D. }[/math]
From this latter form, the connection with Akaike's information criterion is evident.
The idea is that models with smaller DIC should be preferred to models with larger DIC. Models are penalized both by the value of [math]\displaystyle{ \bar{D} }[/math], which favors a good fit, but also (in common with AIC and BIC) by the effective number of parameters [math]\displaystyle{ p_D\, }[/math]. Since [math]\displaystyle{ \bar D }[/math] will decrease as the number of parameters in a model increases, the [math]\displaystyle{ p_D\, }[/math] term compensates for this effect by favoring models with a smaller number of parameters(...)To avoid the over-fitting problems of DIC, Ando (2011) developed Bayesian model selection criteria from a predictive view point.
The criterion is calculated as
[math]\displaystyle{ \mathit{IC} =\bar{D}+2p_D=-2\mathbf{E}^\theta[ \log(p(y|\theta))]+2p_D. }[/math]
The first term is a measure of how well the model fits the data, while the second term is a penalty on the model complexity. Note, that the p in this expression is the predictive distribution rather than the likelihood above.