Department of Sociology, Ludwig-Maximilians-University, Munich, Germany.
PLoS One. 2023 Oct 17;18(10):e0292717. doi: 10.1371/journal.pone.0292717. eCollection 2023.
The validity of scientific findings may be challenged by the replicability crisis (or cases of fraud), which may result not only in a loss of trust within society but may also lead to wrong or even harmful policy or medical decisions. The question is: how reliable are scientific results that are reported as statistically significant, and how does this reliability develop over time? Based on 35,515 papers in psychology published between 1975 and 2017 containing 487,996 test values, this article empirically examines the statistical power, publication bias, and p-hacking, as well as the false discovery rate. Assuming constant true effects, the statistical power was found to be lower than the suggested 80% except for large underlying true effects (d = 0.8) and increased only slightly over time. Also, publication bias and p-hacking were found to be substantial. The share of false discoveries among all significant results was estimated at 17.7%, assuming a proportion θ = 50% of all hypotheses being true and assuming that p-hacking is the only mechanism generating a higher proportion of just significant results compared to just nonsignificant results. As the analyses rely on multiple assumptions that cannot be tested, alternative scenarios were laid out, again resulting in the rather optimistic result that although research results may suffer from low statistical power and publication selection bias, most of the results reported as statistically significant may contain substantial results, rather than statistical artifacts.
科学发现的有效性可能受到可重复性危机(或欺诈案例)的挑战,这不仅可能导致社会信任的丧失,还可能导致错误甚至有害的政策或医疗决策。问题是:报告为具有统计学意义的科学结果的可靠性如何,这种可靠性随着时间的推移如何发展?本文基于 1975 年至 2017 年间发表的包含 487996 个测试值的 35515 篇心理学论文,实证检验了统计功效、发表偏倚和 p 值操纵以及错误发现率。假设真实效应不变,除了较大的真实效应(d=0.8)外,统计功效都低于建议的 80%,而且随着时间的推移仅略有增加。此外,还发现发表偏倚和 p 值操纵现象非常严重。在所有显著结果中,虚假发现的比例估计为 17.7%,假设所有假设中有θ=50%的比例为真,并且假设 p 值操纵是产生比仅仅显著结果更高比例的唯一机制。由于分析依赖于无法检验的多个假设,因此还提出了替代方案,再次得出了相当乐观的结果,即尽管研究结果可能受到低统计功效和发表选择偏倚的影响,但报告为具有统计学意义的大多数结果可能包含实质性结果,而不是统计伪影。