Observed p-Value
A Observed p-Value is a calculated probability value that represents the probability of obtaining test results at least as extreme as those observed, assuming the null hypothesis is true.
- AKA: P-Value, P Value, p-value, p value, Probability Value, Observed Significance Level, Attained Significance Level, Exact Significance, Calculated p-Value, p-Value Score.
- Context:
- It can typically be produced by a p-Value Measure when applied to test statistics and their sampling distributions.
- It can typically quantify Statistical Evidence against the null hypothesis in hypothesis testing tasks.
- It can typically be calculated as P(T ≥ t | H₀) for one-sided tests or P(|T| ≥ |t| | H₀) for two-sided tests.
- It can typically serve as a Continuous Measure of evidence rather than a dichotomous decision tool.
- It can typically be compared to a significance level (α) to determine statistical significance.
- It can typically be the output of a p-Value Measure applied to observed test statistics.
- It can often be misinterpreted as the probability that the null hypothesis is true, which is a common p-value misconception.
- It can often decrease with increasing sample size for a fixed effect size, leading to sample size dependency.
- It can often require multiple testing adjustment when multiple hypothesis tests are performed simultaneously.
- It can often be combined with confidence intervals and effect size measures for comprehensive statistical interpretation.
- It can often be affected by study design, data collection methods, and statistical assumptions.
- It can range from being a Very Small Observed p-Value to being a Large Observed p-Value, depending on its evidence strength.
- It can range from being an Exact Observed p-Value to being an Approximate Observed p-Value, depending on its calculation method.
- It can range from being an Unadjusted Observed p-Value to being an Adjusted Observed p-Value, depending on its multiple comparison correction.
- It can range from being a One-Tailed Observed p-Value to being a Two-Tailed Observed p-Value, depending on its hypothesis directionality.
- It can range from being a Parametric Observed p-Value to being a Non-Parametric Observed p-Value, depending on its distributional assumptions.
- It can be uniformly distributed between 0 and 1 under the null hypothesis when test assumptions are met.
- It can be interpreted within the context of prior probabilities and scientific plausibility.
- It can be reported alongside test statistic values and degrees of freedom for transparency.
- It can be visualized using p-value plots, volcano plots, or Manhattan plots in multiple testing contexts.
- It can serve as input to meta-analysis when combining results across studies.
- It can be transformed using p-value combination methods like Fisher's method or Stouffer's method.
- ...
- Example(s):
- Test-Specific Observed p-Values, such as:
- t-Test Observed p-Value = 0.03 indicating 3% probability of observing the difference under null hypothesis.
- Chi-Square Test Observed p-Value = 0.001 showing strong evidence against independence assumption.
- ANOVA Observed p-Value = 0.15 suggesting insufficient evidence against equal means hypothesis.
- Fisher's Exact Test Observed p-Value = 0.045 for categorical association testing.
- Wilcoxon Test Observed p-Value = 0.02 for non-parametric comparison.
- Kolmogorov-Smirnov Test Observed p-Value = 0.08 for distribution comparison.
- Research Context Observed p-Values, such as:
- Clinical Trial Observed p-Value = 0.04 for drug efficacy comparison.
- A/B Test Observed p-Value = 0.02 for conversion rate difference.
- Genome-Wide Association Observed p-Value = 5×10⁻⁸ for genetic variant association.
- Psychology Study Observed p-Value = 0.048 for behavioral intervention effect.
- Economic Analysis Observed p-Value = 0.001 for policy impact assessment.
- Adjusted Observed p-Values, such as:
- Bonferroni-Adjusted Observed p-Value = 0.05/20 = 0.0025 for 20 comparisons.
- FDR-Adjusted Observed p-Value = 0.012 maintaining 5% false discovery rate.
- Holm-Bonferroni Observed p-Value = 0.05/k for sequential testing.
- Benjamini-Hochberg Observed p-Value for controlling false discovery rate.
- Correlation Test Observed p-Values, such as:
- Pearson Correlation Observed p-Value = 0.005 for linear relationship.
- Spearman Correlation Observed p-Value = 0.03 for monotonic relationship.
- Partial Correlation Observed p-Value = 0.06 controlling for confounders.
- Model Comparison Observed p-Values, such as:
- Likelihood Ratio Test Observed p-Value = 0.02 comparing nested models.
- Deviance Test Observed p-Value = 0.001 for model fit assessment.
- Hosmer-Lemeshow Test Observed p-Value = 0.15 for goodness-of-fit.
- Time Series Observed p-Values, such as:
- Augmented Dickey-Fuller Test Observed p-Value = 0.01 for stationarity.
- Ljung-Box Test Observed p-Value = 0.03 for autocorrelation.
- KPSS Test Observed p-Value = 0.08 for trend stationarity.
- Interpretation Examples:
- p = 0.001: Very strong evidence against null hypothesis.
- p = 0.03: Moderate evidence against null hypothesis at α = 0.05.
- p = 0.15: Insufficient evidence to reject null hypothesis.
- p = 0.50: Data consistent with null hypothesis.
- p = 0.99: Data strongly consistent with null hypothesis.
- ...
- Test-Specific Observed p-Values, such as:
- Counter-Example(s):
- Significance Level (α), which is the predetermined threshold rather than calculated probability.
- Test Statistic, which is the standardized measure rather than its tail probability.
- Effect Size Measure, which quantifies magnitude rather than statistical evidence.
- Confidence Interval, which provides parameter range rather than hypothesis probability.
- Bayes Factor, which provides evidence ratio rather than tail probability.
- Posterior Probability, which incorporates prior information unlike p-values.
- Likelihood Ratio, which compares model likelihoods rather than null probability.
- Statistical Power, which measures ability to detect effects rather than evidence against null.
- False Discovery Rate, which controls expected proportion of false positives rather than individual test probability.
- p-Value Measure, which is the function that calculates p-values rather than the value itself.
- See: p-Value Measure, Statistical Hypothesis Testing Task, Test Statistic, Null Hypothesis, Statistical Significance Level, Statistical Significance Measure, Type I Error Probability Measure, Null Hypothesis Rejection Decision, Common P-Value Misconception, Fisher's Exact Test, Bonferroni Correction, Multiple Testing Problem, Statistical Power Measure, Effect Size Measure, Confidence Interval, Bayesian Hypothesis Testing, Evidential Measure, Extreme Test Statistic Measure.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/P-value Retrieved:2019-10-15.
- In statistical hypothesis testing, the p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. The use of p-values in statistical hypothesis testing is common in many fields of research such as physics, economics, finance, political science, psychology, biology, criminal justice, criminology, and sociology. [1] The misuse of p-values is a controversialtopic in metascience. Italicisation, capitalisation and hyphenation of the term varies. For example, AMA style uses "P value", APA style uses "p value", and the American Statistical Association uses "p-value". [2]
2016a
- (Stat Treak, 2016) ⇒ http://stattrek.com/statistics/dictionary.aspx?definition=P-value Retrieved: 2016-10-09
- QUOTE: A P-value measures the strength of evidence in support of a null hypothesis. Suppose the test statistic in a hypothesis test is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.
2016b
- (Statistical Analysis Glossary, 2016) ⇒ http://www.quality-control-plan.com/StatGuide/sg_glos.htm Retrieved: 2016-10-09
- QUOTE: In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to the pre-selected significance level of the test. If the P value is smaller than the significance level, the null hypothesis is rejected, and the test result is termed significant. The P value depends on both the null hypothesis and the alternative hypothesis. In particular, a test with a one-sided alternative hypothesis will generally have a lower P value (and thus be more likely to be significant) than a test with a two-sided alternative hypothesis. However, one-sided tests require more stringent assumptions than two-sided tests. They should only be used when those assumptions apply.
2015
- (Leek & Peng, 2015) ⇒ Jeffrey T. Leek, and Roger D. Peng. (2015). “Statistics: P values are just the tip of the iceberg.” In: Nature, 520(7549).
- QUOTE: There is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as 'null hypothesis significance testing' (NHST; see, for example, http://go.nature.com/pfvgqe). NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February.
2010
- http://en.wikipedia.org/wiki/P-value
- … The lower the p-value, the less likely the result, assuming the Null Hypothesis, so the more "significant" the result, in the sense of Statistical Significance – one often uses p-values of 0.05 or 0.01, corresponding to a 5% chance or 1% of an outcome that extreme, given the null hypothesis. It should be noted, however, that the idea of more or less significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray.
More technically, a p-value of an experiment is a random variable defined over the Sample Space of the experiment such that its distribution under the null hypothesis is uniform on the interval [0,1]. Many p-values can be defined for the same experiment.
- … The lower the p-value, the less likely the result, assuming the Null Hypothesis, so the more "significant" the result, in the sense of Statistical Significance – one often uses p-values of 0.05 or 0.01, corresponding to a 5% chance or 1% of an outcome that extreme, given the null hypothesis. It should be noted, however, that the idea of more or less significance is here only being used for illustrative purposes. The result of a test of significance is either "statistically significant" or "not statistically significant"; there are no shades of gray.
2009
- (Sun & Wu, 2009) ⇒ Yijun Sun, and Dapeng Wu. (2009). “Feature Extraction Through Local Learning.” In: Statistical Analysis and Data Mining, 2(1). doi:10.1002/sam.10028
- QUOTE: … In wrapper methods, a classification algorithm is employed to evaluate the goodness of a selected feature subset, whereas in filter methods criterion functions evaluate feature subsets by their information content, typically interclass distance (e.g., Fisher score) or statistical measures (e.g., p-value of t-test), instead of optimizing the performance of any specific learning algorithm directly.
2001
- (Sterne & Smith, 2001) ⇒ Jonathan A C Sterne, and George Davey Smith. (2001). “Sifting the Evidence — What's wrong with significance tests?". In: BMJ, 322(7280). doi:10.1136/bmj.322.7280.226
- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis
An arbitrary division of the results, into “significant” or “non-significant” according to the P value, was not the intention of the founders of statistical inference
A P value of 0.05 need not provide strong evidence against the null hypothesis, but it is reasonable to say that P<0.001 does. In the results sections of papers the precise P value should be presented, without reference to arbitrary thresholds
Results of the medical research should not be reported as “significant” or “non-significant” but should be interpreted in the context of the type of the study and other available evidence. Bias or confounding should always be considered for findings with low P values
- QUOTE: P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis
1999
- (Goodman, 1999) ⇒ Steven N. Goodman. (1999). “Toward Evidence-based Medical Statistics. 1: The P Value Fallacy.” In: Annals Internal Medicine, 130(12).
- ABSTRACT: An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain "error rates," without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used -- the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.
1995
- (Bland & Altman, 1995) ⇒ J Martin Bland, and Douglas G Altman. (1995). “Multiple Significance Tests: the Bonferroni method.” In: BMJ 1995;310:170
- QUOTE: Many published papers include large numbers of significance tests. These may be difficult to interpret because if we go on testing long enough we will inevitably find something which is "significant." We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in 20 which we expect by chance alone. ...
1925
- (Fisher, 1992) ⇒ Ronald A. Fisher. (1925). “Statistical Methods for Research Workers.” Oliver & Boyd.
- ↑ Babbie, E. (2007). The practice of social research 11th ed. Thomson Wadsworth: Belmont, California.
- ↑ http://magazine.amstat.org/wp-content/uploads/STATTKadmin/style%5B1%5D.pdf