# P-Value

A p-value is a probability measure used in frequentist statistics for testing the strength of the null hypothesis against an alternative hypothesis.

**AKA:**p-Value.**Context:**- It can be defined as a summary statistic/estimated probability from a statistical significance test (on an observed sample) of getting results at least as extreme as the ones you observed, given a correct null hypothesis.
- It can be used to Reject a Null Hypothesis, i.e. when the test statistic probability is less than a predefined significance level, the null hypothesis is rejected.

**Example(s):**- Assume a coin-toss experiment with a fair-coin null hypothesis (that you suspect is weighted toward heads). If there are more head events than tail events after x coin tosses, then the p-value is an estimated probability that one would get at least as many head eventss if the coin was indeed a fair coin.

**See:**t-Test, Statistical Score, Statistical Significance, Null Hypothesis, Q Value/False Discovery Rate, Type I Error Rate, Frequentist Inference, Bayesian Inference.

## References

### 2016

- (Stat Treak, 2016) ⇒ http://stattrek.com/statistics/dictionary.aspx?definition=P-value
*Retrieved: 2016-10-09*- QUOTE: A P-value measures the strength of evidence in support of a null hypothesis. Suppose the test statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.

- (Statistical Analysis Glossary, 2016) ⇒ http://www.quality-control-plan.com/StatGuide/sg_glos.htm
*Retrieved: 2016-10-09*- QUOTE: In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to the pre-selected significance level of the test. If the P value is smaller than the significance level, the null hypothesis is rejected, and the test result is termed significant. The P value depends on both the null hypothesis and the alternative hypothesis. In particular, a test with a one-sided alternative hypothesis will generally have a lower P value (and thus be more likely to be significant) than a test with a two-sided alternative hypothesis. However, one-sided tests require more stringent assumptions than two-sided tests. They should only be used when those assumptions apply.

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/p-value Retrieved:2015-12-2.
- In statistics, the
**p-value**is a function of the observed sample results (a statistic) that is used for testing a statistical hypothesis. More specifically, the*p*-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, assuming that the hypothesis under consideration is true. Here, "more extreme" is dependent on the way the hypothesis is tested. Before the test is performed, a threshold value is chosen, called the significance level of the test, traditionally 5% or 1% and denoted as*α*. If the*p*-value is equal to or smaller than the significance level (*α*), it suggests that the observed data are inconsistent with the assumption that the null hypothesis is true and thus that hypothesis must be rejected (but this does not automatically mean the alternative hypothesis can be accepted as true). When the*p*-value is calculated correctly, such a test is guaranteed to control the Type I error rate to be no greater than*α*. Since*p*-value is used in frequentist inference (and not Bayesian inference), it does not in itself support reasoning about the probabilities of hypotheses but is only as a tool for deciding whether to reject the null hypothesis. Statistical hypothesis tests making use of*p*-values are commonly used in many fields of science and social sciences, such as economics, psychology, biology, criminal justice and criminology, and sociology.Misuse of this tool continues to be the subject of criticism.

- In statistics, the

- (Leek & Peng, 2015) ⇒ Jeffrey T. Leek, and Roger D. Peng. (2015). “Statistics: P values are just the tip of the iceberg.” In: Nature, 520(7549).
- QUOTE: There is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as 'null hypothesis significance testing' (NHST; see, for example, http://go.nature.com/pfvgqe). NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February.

### 2010

- http://en.wikipedia.org/wiki/P-value
- … The
*lower*the p-value, the*less*likely the result, assuming the Null Hypothesis, so the*more*"significant" the result, in the sense of Statistical Significance – one often uses p-values of 0.05 or 0.01, corresponding to a 5% chance or 1% of an outcome that extreme, given the null hypothesis. It should be noted, however, that the idea of*more*or*less*significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray.More technically, a p-value of an experiment is a random variable defined over the Sample Space of the experiment such that its distribution under the null hypothesis is uniform on the interval [0,1]. Many p-values can be defined for the same experiment.

- … The

### 2009

- (Sun & Wu, 2009) ⇒ Yijun Sun, and Dapeng Wu. (2009). “Feature Extraction Through Local Learning.” In: Statistical Analysis and Data Mining, 2(1). doi:10.1002/sam.10028
- QUOTE: … In wrapper methods, a classification algorithm is employed to evaluate the goodness of a selected feature subset, whereas in filter methods criterion functions evaluate feature subsets by their information content, typically interclass distance (e.g., Fisher score) or statistical measures (e.g., p-value of t-test), instead of optimizing the performance of any specific learning algorithm directly.

### 2001

- (Sterne & Smith, 2001) ⇒ Jonathan A C Sterne, and George Davey Smith. (2001). “Sifting the Evidence — What's wrong with significance tests?". In: BMJ, 322(7280). doi:10.1136/bmj.322.7280.226
- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis
An arbitrary division of the results, into “significant” or “non-significant” according to the P value, was not the intention of the founders of statistical inference

A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P<0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds

Results of the medical research should not be reported as “significant” or “non-significant” but should be interpreted in the context of the type of the study and other available evidence. Bias or confounding should always be considered for findings with low P values

- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis

### 1999

- (Goodman, 1999) ⇒ Steven N. Goodman. (1999). “Toward Evidence-based Medical Statistics. 1: The P Value Fallacy.” In: Annals Internal Medicine, 130(12).
- ABSTRACT: An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain "error rates," without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used -- the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.

### 1995

- (Bland & Altman, 1995) ⇒ J Martin Bland, and Douglas G Altman. (1995). “Multiple Significance Tests: the Bonferroni method.” In: BMJ 1995;310:170
- QUOTE: Many published papers include large numbers of significance tests. These may be difficult to interpret because if we go on testing long enough we will inevitably find something which is "significant." We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in 20 which we expect by chance alone. ...