2001 SiftingTheEvidence

(Sterne & Smith, 2001) ⇒ Jonathan A. C. Sterne, and George Davey Smith. (2001). “Sifting the Evidence — What's wrong with significance tests?". In: BMJ, 322(7280). doi:10.1136/bmj.322.7280.226

Subject Headings:

Notes

It suggests that the probability of a significant association being true or false is dependent both on the significance level and the overall proportion of hypotheses being tested for which the null hypothesis is true.

Cited By

~450 http://scholar.google.com/scholar?q=%22Sifting+the+Evidence+%E2%80%94+What%27s+wrong+with+significance+tests%3F%22+2001

Quotes

Abstract

The findings of medical research are often met with considerable scepticism, even when they have apparently come from studies with sound methodologies that have been subjected to appropriate statistical analysis. This is perhaps particularly the case with respect to epidemiological findings that suggest that some aspect of everyday life is bad for people. Indeed, one recent popular history, the medical journalist James Le Fanu's The Rise and Fall of Modern Medicine, went so far as to suggest that the solution to medicine's ills would be the closure of all departments of epidemiology. 1

One contributory factor is that the medical literature shows a strong tendency to accentuate the positive; positive outcomes are more likely to be reported than null results. 2 – 4 By this means alone a host of purely chance findings will be published, as by conventional reasoning examining 20 associations will produce one result that is “significant at P=0.05” by chance alone. If only positive findings are published then they may be mistakenly considered to be of importance rather than being the necessary chance results produced by the application of criteria for meaningfulness based on statistical significance. As many studies contain long questionnaires collecting information on hundreds of variables, and measure a wide range of potential outcomes, several false positive findings are virtually guaranteed. The high volume and often contradictory nature 5 of medical research findings, however, is not only because of publication bias. A more fundamental problem is the widespread misunderstanding of the nature of statistical significance.

In this paper we consider how the practice of significance testing emerged; an arbitrary division of results as “significant” or “non-significant” (according to the commonly used threshold of P=0.05) was not the intention of the founders of statistical inference. P values need to be much smaller than 0.05 before they can be considered to provide strong evidence against the null hypothesis; this implies that more powerful studies are needed. Reporting of medical research should continue to move from the idea that results are significant or non-significant to the interpretation of findings in the context of the type of study and other available evidence. Editors of medical journals are in an excellent position to encourage such changes, and we conclude with proposed guidelines for reporting and interpretation.

Summary points

P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis.
An arbitrary division of the results, into “significant” or “non-significant” according to the P value, was not the intention of the founders of statistical inference
A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P<0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds
Results of the medical research should not be reported as “significant” or “non-significant” but should be interpreted in the context of the type of the study and other available evidence. Bias or confounding should always be considered for findings with low P values
To stop the discrediting of the medical research by chance findings we need more powerful studies

References

1. Le Fanu J. “The rise and fall of the modern medicine.New York:Little, Brown,1999.
2. Berlin JA, Begg , Louis TA. “An assessment of the publication bias using a sample of the published clinical trials.J Am Stat Assoc1989; 84:381–392.
3. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. “Publication bias in clinical research.Lancet1991; 337:867–872.
4. Dickersin K, Min YI, Meinert CL. “Factors influencing publication of the research results: follow-up of the applications submitted to two institutional review boards.JAMA1992; 263:374–378.
5. Mayes LC, Horwitz , Feinstein AR. “A collection of the 56 topics with contradictory results in case-control research.Int J Epidemiol1988; 17:680–685.
6. Goodman . “P values, hypothesis tests, and likelihood: implications for epidemiology of the a neglected historical debate.Am J Epidemiol1993; 137:485–496.
7. Lehmann EL. “The Fisher, Neyman-Pearson theories of the testing hypotheses: of theory or two?J Am Stat Assoc1993; 88:1242–1249.
8. Goodman . “Toward evidence-based medical statistics. 1: The P value fallacy.Ann Intern Med1999; 130:995–1004.
9. Fisher RA. “Statistical methods for research workers.London:Oliver and Boyd,1950:80.
10. Neyman J, Pearson E. “the problem of the most efficient tests of the statistical hypotheses.Philos Trans Roy Soc A1933; 231:289–337.
Fisher RA. “Statistical methods and scientific inference.London:Collins Macmillan,1973.
Feinstein AR. “P-values and confidence intervals: two sides of the same unsatisfactory coin.J Clin Epidemiol1998; 51:355–360.
Berkson J. “Tests of the significance considered as evidence.J Am Stat Assoc1942; 37:325–335.
Rozeboom WW. “The fallacy of the null-hypothesis significance test.Psychol Bull1960; 57:416–428.
Freiman JA, Chalmers TC, Smith HJ, Kuebler RR. “The importance of the beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of the 71 “negative” trials.N Engl J Med1978; 299:690–694.
Cox DR. “Statistical significance tests.Br J Clin Pharmacol1982; 14:325–331.
Rothman KJ. “Significance questing.Ann Intern Med1986; 105:445–447.
Altman DG, Gore SM, Gardner MJ, Pocock SJ. “Statistical guidelines for contributors to medical journals.BMJ1983; 286:1489–1493.
Gardner MJ, Altman DG. “Confidence intervals rather than P values: estimation rather than hypothesis testing.BMJ1986; 292:746–750.
Gardner MJ, Altman DG. “Statistics with confidence. Confidence intervals and statistical guidelines.London:BMJ Publishing,1989.
Hopkins PN, Williams RR. “Identification and relative weight of the cardiovascular risk factors.Cardiol Clin1986; 4:3–31.
Bailar JC, Mosteller F. Freiman JA, Chalmers TC, Smith H, Kuebler RR. “The importance of the beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial. In: Bailar JC, Mosteller F, . Medical uses of the statistics.Boston, Ma:NEJM Books,1992:357–373.
Moher D, Dulberg , Wells GA. “Statistical power, sample size, and their reporting in randomized controlled trials.JAMA1994; 272:122–124.
Mulward S, Gøtzsche PC. “Sample size of the randomized double-blind trials 1976-1991.Dan Med Bull1996; 43:96–98.
Oakes M. “Statistical inference.Chichester:Wiley,1986.
Browner WS, Newman TB. “Are all significant P values created equal? The analogy between diagnostic tests and clinical research.JAMA1987; 257:2459–2463.
Edwards W, Lindman H, Savage LJ. “Bayesian statistical inference for psychological research.Psychol Rev1963; 70:193–242.
Berger JO, Sellke T. “Testing a point null hypothesis: the irreconcilability of the P values and evidence.J Am Stat Assoc1987; 82:112–122.
Lilford RJ, Braunholtz D. “The statistical basis of the public policy: a paradigm shift is overdue.BMJ1996; 313:603–607.
Brophy JM, Joseph L. “Placing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes.JAMA1995; 273:871–875.
Burton PR, Gurrin LC, Campbell MJ. “Clinical significance not statistical significance: a simple Bayesian alternative to P values.J Epidemiol Community Health1998; 52:318–323.
Goodman . “Toward evidence-based medical statistics. 2: The Bayes factor.Ann Intern Med1999; 130:1005–1013.
Phillips AN, Davey Smith G. “The design of the prospective epidemiological studies: more subjects or better measurements?J Clin Epidemiol1993; 46:1203–1211.
Yusuf S, Collins R, Peto R. “Why do we need some large, simple randomized trials?Stat Med1984; 3:409–422.
Egger M, Davey Smith G. “Meta-analysis. Potentials and promise.BMJ1997; 315:1371–1374.
Danesh J, Whincup P, Walker M, Lennon L, Thomson A, Appleby P, et al. “Chlamydia pneumoniae IgG titres and coronary heart disease: prospective study and meta-analysis.BMJ2000; 321:208–213.
Morris JN. “The uses of the epidemiology.Edinburgh:Churchill-Livingstone,1975.
Davey Smith G. “Reflections the limits to epidemiology. J Clin Epidemiol (in press).
Cole P. “The hypothesis generating machine.Epidemiology1993; 4:271–273.
Davey Smith G, Phillips AN. “Confounding in epidemiological studies: why “independent” effects may not be all they seem.BMJ1992; 305:757–759.

, .