DFFITS

From GM-RKB
Jump to navigation Jump to search

A DFFITS is an algorithm that measures the influential observation of a fitted value in a statistical regression.



References

2016

  • (Wikipedia, 2016) ⇒ https://www.wikiwand.com/en/DFFITS Retrieved 2016-07-24
    • DFFITS is a diagnostic meant to show how influential a point is in a statistical regression. It was proposed in 1980. It is defined as the Studentized DFFIT, where the latter is the change in the predicted value for a point, obtained when that point is left out of the regression; Studentization is achieved by dividing by the estimated standard deviation of the fit at that point:
[math]\displaystyle{ \text{DFFITS} = {\widehat{y_i} - \widehat{y_{i(i)}} \over s_{(i)} \sqrt{h_{ii}}} }[/math]
where [math]\displaystyle{ \widehat{y_i} }[/math] and [math]\displaystyle{ \widehat{y_{i(i)}} }[/math] are the prediction for point i with and without point i included in the regression, [math]\displaystyle{ s_{(i)} }[/math] is the standard error estimated without the point in question, and [math]\displaystyle{ h_{ii} }[/math] is the leverage for the point.
DFFITS is very similar to the externally Studentized residual, and is in fact equal to the latter times [math]\displaystyle{ \sqrt{h_{ii}/(1-h_{ii})} }[/math].
As when the errors are Gaussian the externally Studentized residual is distributed as Student's t (with a number of degrees of freedom equal to the number of residual degrees of freedom minus one), DFFITS for a particular point will be distributed according to this same Student's t distribution multiplied by the leverage factor [math]\displaystyle{ \sqrt{h_{ii}/(1-h_{ii})} }[/math] for that particular point. Thus, for low leverage points, DFFITS is expected to be small, whereas as the leverage goes to 1 the distribution of the DFFITS value widens infinitely.
For a perfectly balanced experimental design (such as a factorial design or balanced partial factorial design), the leverage for each point is p/n, the number of parameters divided by the number of points. This means that the DFFITS values will be distributed (in the Gaussian case) as [math]\displaystyle{ \sqrt{p \over n-p} \approx \sqrt{p \over n} }[/math] times a t variate. Therefore, the authors suggest investigating those points with DFFITS greater than [math]\displaystyle{ 2\sqrt{p \over n} }[/math].
Although the raw values resulting from the equations are different, Cook's distance and DFFITS are conceptually identical and there is a closed-form formula to convert one value to the other.

2009

[math]\displaystyle{ \text{DFFITS}(i) = \left(x_i \beta − \beta(i)\right) / SE(x_i\beta) }[/math]
where [math]\displaystyle{ \beta(i) }[/math] is the least squares estimator of [math]\displaystyle{ \beta }[/math] without the i-th case and [math]\displaystyle{ SE(x_i\beta) }[/math] is an estimator of the standard error (SE) of the fitted values.(DFFITS)is the standardized change in the fitted value of a case when it is deleted. Thus it can be considered a measure of influence on individual fitted values.
Another useful measure of influence is Cook’s D (Cook and Weisberg, 1982), which evaluated at the ith case is given by
[math]\displaystyle{ D_i = (\beta − \beta(i))' X'X (\beta − \beta(i))/(ps^2) }[/math]
[math]\displaystyle{ D_i }[/math] is a measure of the change in all of the fitted values when a case is deleted. Even though [math]\displaystyle{ D_i }[/math] is based on a different theoretical consideration, it is closely related to DFFITS.
It is important to mention that these measures are useful for detecting single cases with an unduly high influence. These

indexes, however, suffer from the problem of masking — that is, the presence of cases can disguise or mask the potential influence of other cases.

1980