地球是平的（p>0.05）：显著性阈值与不可重复研究的危机。

The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research.

作者信息

Amrhein Valentin, Korner-Nievergelt Fränzi, Roth Tobias

机构信息

Zoological Institute, University of Basel, Basel, Switzerland.

Research Station Petite Camargue Alsacienne, Saint-Louis, France.

出版信息

PeerJ. 2017 Jul 7;5:e3544. doi: 10.7717/peerj.3544. eCollection 2017.

DOI:10.7717/peerj.3544

PMID:28698825

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5502092/

Abstract

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading -values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small -values at face value, but mistrust results with larger -values. In either case, -values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance ( ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, -hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, -values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger -values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger -values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that -values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

摘要

将“统计显著性”广泛用作宣称一项科学发现的依据，会导致科学过程出现相当大的扭曲（根据美国统计协会的说法）。我们审视了为何将P值划分为“显著”和“不显著”会导致研究不可重复，或者看起来不可重复。一个主要问题是，我们倾向于从表面价值看待小的P值，但不信任大P值的结果。在这两种情况下，P值几乎无法说明研究的可靠性，因为即使备择假设为真，它们也很难被重复验证。同样，显著性（P≤0.05）也很难被重复验证：在80%的良好统计功效下，如果存在真实效应，在三分之一的情况下，两项研究会“相互冲突”，即一项显著而另一项不显著。因此，不能仅仅因为一项重复研究不显著就将其解释为失败。许多明显的重复失败可能因此反映了基于显著性阈值的错误判断，而非不可重复研究的危机。关于一项发现的可重复性和实际重要性的可靠结论，只能通过综合来自多个独立研究的累积证据得出。然而，应用显著性阈值会使累积知识变得不可靠。一个原因是，在统计功效不理想的情况下，显著的效应大小会向上偏倚。因此，在忽略不显著结果的同时解读夸大的显著结果会导致错误结论。但当前追求显著性的激励措施导致了选择性报告以及对不显著发现的发表偏倚。应该通过去除固定的显著性阈值来解决数据挖掘、P值操纵和发表偏倚问题。与已故的罗纳德·费希尔的建议一致，P值应被解释为反对原假设的证据强度的分级度量。同样，较大的P值也提供了一些反对原假设的证据，不能将其解释为支持原假设，错误地得出“没有效应”的结论。与数据兼容的可能真实效应大小的信息必须从点估计（例如样本均值）和区间估计（如置信区间）中获取。我们审视了对较大P值解释的困惑如何可以追溯到现代统计学创始人之间的历史争论。我们进一步讨论了反对去除显著性阈值的潜在论点，例如决策规则应该更严格、样本量可能会减少，或者P值应该更好地被完全摒弃。我们得出结论，无论我们使用何种统计推断方法，二分法阈值思维都必须让位于非自动化的明智判断。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c03/5502092/69898727e057/peerj-05-3544-g001.jpg

相似文献

The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research.

PeerJ. 2017 Jul 7;5:e3544. doi: 10.7717/peerj.3544. eCollection 2017.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Why and how we should join the shift from significance testing to estimation.

J Evol Biol. 2022 Jun;35(6):777-787. doi: 10.1111/jeb.14009. Epub 2022 May 18.

Publication bias impacts on effect size, statistical power, and magnitude (Type M) and sign (Type S) errors in ecology and evolutionary biology.

BMC Biol. 2023 Apr 3;21(1):71. doi: 10.1186/s12915-022-01485-y.

Statistical significance and publication reporting bias in abstracts of reproductive medicine studies.

Hum Reprod. 2023 Nov 28;39(3):548-558. doi: 10.1093/humrep/dead248.

Statistical Significance

Statistical significance testing and p-values: Defending the indefensible? A discussion paper and position statement.

Int J Nurs Stud. 2019 Nov;99:103384. doi: 10.1016/j.ijnurstu.2019.07.001. Epub 2019 Jul 22.

Are most published research findings false? Trends in statistical power, publication selection bias, and the false discovery rate in psychology (1975-2017).

PLoS One. 2023 Oct 17;18(10):e0292717. doi: 10.1371/journal.pone.0292717. eCollection 2023.

Statistics in ophthalmology revisited: the (effect) size matters.

Acta Ophthalmol. 2018 Nov;96(7):e885-e888. doi: 10.1111/aos.13756. Epub 2018 Sep 5.

The continuing misuse of null hypothesis significance testing in biological anthropology.

Am J Phys Anthropol. 2018 May;166(1):236-245. doi: 10.1002/ajpa.23399. Epub 2018 Jan 18.

引用本文的文献

Marital status and risk of cardiovascular disease - a multi-analyst study in epidemiology.

Eur J Epidemiol. 2025 May 5. doi: 10.1007/s10654-025-01235-8.

Testing the reproducibility of ecological studies on insect behavior in a multi-laboratory setting identifies opportunities for improving experimental rigor.

PLoS Biol. 2025 Apr 22;23(4):e3003019. doi: 10.1371/journal.pbio.3003019. eCollection 2025 Apr.

Sex differences in romantic love: an evolutionary perspective.

Biol Sex Differ. 2025 Feb 24;16(1):16. doi: 10.1186/s13293-025-00698-4.

The null hypothesis significance test and the dichotomization of the p-value: Errare Humanum Est.

Rev Peru Med Exp Salud Publica. 2025 Jan 31;41(4):422-430. doi: 10.17843/rpmesp.2024.414.14285..

Creative music therapy in preterm infants effects cerebrovascular oxygenation and perfusion.

Sci Rep. 2024 Nov 15;14(1):28249. doi: 10.1038/s41598-024-75282-8.

Estimating the replicability of highly cited clinical research (2004-2018).

PLoS One. 2024 Aug 7;19(8):e0307145. doi: 10.1371/journal.pone.0307145. eCollection 2024.

One Question, Many Results—Why Epidemiological Studies Yield Heterogeneous Findings. Part 34 of a Series on Evaluation of Scientific Publications.

Dtsch Arztebl Int. 2024 Nov 1;121(22):740-745. doi: 10.3238/arztebl.m2024.0135.

For a proper use of frequentist inferential statistics in public health.

Glob Epidemiol. 2024 Jun 15;8:100151. doi: 10.1016/j.gloepi.2024.100151. eCollection 2024 Dec.

Personalized compression therapeutic textiles: digital design, development, and biomechanical evaluation.

Front Bioeng Biotechnol. 2024 Jun 26;12:1405576. doi: 10.3389/fbioe.2024.1405576. eCollection 2024.

Diverse Strategies for Modulating Insulin Resistance: Causal or Consequential Inference on Metabolic Parameters in Treatment-Naïve Subjects with Type 2 Diabetes.

Medicina (Kaunas). 2024 Jun 17;60(6):991. doi: 10.3390/medicina60060991.

本文引用的文献

Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses.

J Math Psychol. 2016 Jun;72:90-103. doi: 10.1016/j.jmp.2015.12.007. Epub 2016 Feb 5.

Statistical Significance and the Dichotomization of Evidence: The Relevance of the for Statisticians.

J Am Stat Assoc. 2017;112(519):902-904. doi: 10.1080/01621459.2017.1311265. Epub 2017 Oct 30.

Meta-assessment of bias in science.

Proc Natl Acad Sci U S A. 2017 Apr 4;114(14):3714-3719. doi: 10.1073/pnas.1618569114. Epub 2017 Mar 20.

Making sense of replications.

Elife. 2017 Jan 19;6:e23383. doi: 10.7554/eLife.23383.

Underappreciated problems of low replication in ecological field studies.

Ecology. 2016 Oct;97(10):2554-2561. doi: 10.1002/ecy.1506. Epub 2016 Sep 9.

Current Incentives for Scientists Lead to Underpowered Studies with Erroneous Conclusions.

PLoS Biol. 2016 Nov 10;14(11):e2000995. doi: 10.1371/journal.pbio.2000995. eCollection 2016 Nov.

The natural selection of bad science.

R Soc Open Sci. 2016 Sep 21;3(9):160384. doi: 10.1098/rsos.160384. eCollection 2016 Sep.

Misconceptions of the p-value among Chilean and Italian Academic Psychologists.

Front Psychol. 2016 Aug 23;7:1247. doi: 10.3389/fpsyg.2016.01247. eCollection 2016.

What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science.

Perspect Psychol Sci. 2016 Jul;11(4):539-44. doi: 10.1177/1745691616646366.

Confidence intervals are no salvation from the alleged fickleness of the P value.

Nat Methods. 2016 Jul 28;13(8):605-6. doi: 10.1038/nmeth.3932.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

地球是平的（p>0.05）：显著性阈值与不可重复研究的危机。

The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research.

作者信息

Amrhein Valentin, Korner-Nievergelt Fränzi, Roth Tobias

机构信息

Zoological Institute, University of Basel, Basel, Switzerland.

Research Station Petite Camargue Alsacienne, Saint-Louis, France.

出版信息

PeerJ. 2017 Jul 7;5:e3544. doi: 10.7717/peerj.3544. eCollection 2017.

DOI:10.7717/peerj.3544

PMID:28698825

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5502092/

Abstract

摘要

地球是平的（p>0.05）：显著性阈值与不可重复研究的危机。

The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

地球是平的（p>0.05）：显著性阈值与不可重复研究的危机。

The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献