Wilson Score Interval

From GM-RKB
Jump to navigation Jump to search

A Wilson Score Interval is a approximate Binomial proportion confidence interval that provides a method for calculating a confidence interval for a proportion in a statistical population.



References

2023

  • (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval Retrieved:2023-11-26.
    • The Wilson score interval is an improvement over the normal approximation interval in multiple respects. It was developed by Edwin Bidwell Wilson (1927).[1] Unlike the symmetric normal approximation interval (above), the Wilson score interval is asymmetric. It does not suffer from problems of overshoot and zero-width intervals that afflict the normal interval, and it may be safely employed with small samples and skewed observations.[2] The observed coverage probability is consistently closer to the nominal value, [math]\displaystyle{ 1 - \alpha }[/math] .[3]

      Like the normal interval, the interval can be computed directly from a formula.

      Wilson started with the normal approximation to the binomial: : [math]\displaystyle{ z \approx \frac{~\left(\,p - \hat{p}\,\right)~}{\sigma_n} }[/math] with the analytic formula for the sample standard deviation given by

      [math]\displaystyle{ \sigma_n = \sqrt{\,\frac{\,p\left(1-p\right)\,}{n}~}~. }[/math]

      Combining the two, and squaring out the radical, gives an equation that is quadratic in : : [math]\displaystyle{ \left(\, \hat{p} - p \,\right)^{2} = z^{2}\cdot\frac{\,p\left(1-p\right)\,}{n} }[/math] Transforming the relation into a standard-form quadratic equation for , treating [math]\displaystyle{ \hat p }[/math] and as known values from the sample (see prior section), and using the value of that corresponds to the desired confidence for the estimate of gives this:

      [math]\displaystyle{ \lt P\gt \left( 1 + \frac{\,z^2\,}{n} \right) p^2 + \lt P\gt \left( - 2 {\hat p} - \frac{\,z^2\,}{n} \right) p + \lt P\gt \biggl( {\hat p}^2 \biggr) = 0 ~, }[/math]

      where all of the values in parentheses are known quantities.

      The solution for estimates the upper and lower limits of the confidence interval for . Hence the probability of success is estimated by : [math]\displaystyle{ p \approx ( w^- , w^+ ) = \frac{1}{~1+\frac{\,z^2\,}{n}~}\left( \hat p+\frac{\,z^2\,}{2n} \right) ~ \pm ~ \frac{z}{~1+\frac{z^2}{n}~}\sqrt{\frac{\,\hat p(1-\hat p)\,}{n}+\frac{\,z^2\,}{4n^2}~} ~ }[/math] or the equivalent : [math]\displaystyle{ p \approx \frac{~ n_S + \tfrac{1}{2} z^2 ~}{ n + z^2 } ~ \pm ~ \frac{z}{n + z^2} \sqrt{ \frac{~n_S \, n_F~}{n} + \frac{z^2}{4} ~ }~. }[/math] The practical observation from using this interval is that it has good properties even for a small number of trials and / or an extreme probability.

      Intuitively, the center value of this interval is the weighted average of [math]\displaystyle{ \hat{p} }[/math] and [math]\displaystyle{ \tfrac{1}{2} }[/math] , with [math]\displaystyle{ \hat{p} }[/math] receiving greater weight as the sample size increases. Formally, the center value corresponds to using a pseudocount of z2, the number of standard deviations of the confidence interval: add this number to both the count of successes and of failures to yield the estimate of the ratio. For the common two standard deviations in each direction interval (approximately 95% coverage, which itself is approximately 1.96 standard deviations), this yields the estimate [math]\displaystyle{ (n_S+2)/(n+4) }[/math] , which is known as the "plus four rule".

      Although the quadratic can be solved explicitly, in most cases Wilson's equations can also be solved numerically using the fixed-point iteration : [math]\displaystyle{ p_{k+1}=\hat{p} \pm z\cdot\sqrt{\frac{ p_k \cdot \left( 1 - p_k \right)}{n}} }[/math] with [math]\displaystyle{ p_0 = \hat{p} }[/math] .

      The Wilson interval can also be derived from the single sample z-test or Pearson's chi-squared test with two categories. The resulting interval, : [math]\displaystyle{ \left\{ \theta \,\,\bigg|\,\, y \le \frac{\hat{p} - \theta}{\sqrt{\tfrac{1}{n} \theta(1 - \theta)}} \le z \right\}, }[/math] can then be solved for [math]\displaystyle{ \theta }[/math] to produce the Wilson score interval. The test in the middle of the inequality is a score test.

  1. Cite error: Invalid <ref> tag; no text was provided for refs named Wallis2013
  2. Cite error: Invalid <ref> tag; no text was provided for refs named New

2021

  • (O'Neill, 2021) ⇒ Barry O'Neill. (2021). “Mathematical properties and finite-population correction for the Wilson score interval.” In: arXiv preprint arXiv:2109.12464. arXiv.org.
    • NOTE: This paper focuses on the mathematical properties of the Wilson score interval, specifically its application to finite populations. It discusses the generalization of the Wilson score interval, aiming to maintain its core properties while adapting it for broader statistical use.

1998

  • (Newcombe, 1998) ⇒ Robert G. Newcombe. (1998). “Interval estimation for the difference between independent proportions: comparison of eleven methods.” In: Statistics in Medicine. Wiley Online Library.
    • NOTE: This study addresses the construction of interval estimates for differences between independent proportions, highlighting the effectiveness of Wilson score intervals. It provides a comparative analysis of eleven different methods, underscoring the advantages and potential applications of Wilson score intervals in various statistical scenarios.

1998

1927