Noguchi Kimihiro, Konietschke Frank, Marmolejo-Ramos Fernando, Pauly Markus
Department of Mathematics, Western Washington University, Bellingham, WA, 98225, USA.
Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany.
Behav Res Methods. 2021 Dec;53(6):2712-2724. doi: 10.3758/s13428-021-01595-5. Epub 2021 May 28.
Recent replication crisis has led to a number of ad hoc suggestions to decrease the chance of making false positive findings. Among them, Johnson (Proceedings of the National Academy of Sciences, 110, 19313-19317, 2013) and Benjamin et al. (Nature Human Behaviour, 2, 6-10 2018) recommend using the significance level of α = 0.005 (0.5%) as opposed to the conventional 0.05 (5%) level. Even though their suggestion is easy to implement, it is unclear whether or not the commonly used statistical tests are robust and/or powerful at such a small significance level. Therefore, the main aim of our study is to investigate the robustness and power curve behaviors of independent (unpaired) two-sample tests for metric and ordinal data at nominal significance levels of α = 0.005 and α = 0.05. Through an extensive simulation study, it is found that the permutation versions of the Welch t-test and the Brunner-Munzel test are particularly robust and powerful while the commonly used two-sample tests which utilize t-distribution tend to be either liberal or conservative, and have peculiar power curve behaviors under skewed distributions with variance heterogeneity.
最近的复制危机引发了一些临时建议,以降低得出假阳性结果的可能性。其中,约翰逊(《美国国家科学院院刊》,110卷,19313 - 19317页,2013年)以及本杰明等人(《自然·人类行为》,第2卷,6 - 10页,2018年)建议使用α = 0.005(0.5%)的显著性水平,而非传统的0.05(5%)水平。尽管他们的建议易于实施,但尚不清楚常用的统计检验在如此小的显著性水平下是否稳健且/或有效。因此,我们研究的主要目的是在α = 0.005和α = 0.05的名义显著性水平下,研究度量和有序数据的独立(非配对)双样本检验的稳健性和功效曲线行为。通过广泛的模拟研究发现,韦尔奇t检验和布鲁纳 - 蒙泽尔检验的排列版本特别稳健且有效,而常用的利用t分布的双样本检验往往要么宽松要么保守,并且在具有方差不齐性的偏态分布下具有特殊的功效曲线行为。