Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, Bloomington, IN, USA.
Institute of Industrial Science, The University of Tokyo, Tokyo, Japan.
Int J Obes (Lond). 2020 Jun;44(6):1440-1449. doi: 10.1038/s41366-020-0554-2. Epub 2020 Feb 25.
BACKGROUND/OBJECTIVES: Genetic contributors to obesity are frequently studied in murine models. However, the sample sizes of these studies are often small, and the data may violate assumptions of common statistical tests, such as normality of distributions. We examined whether, in these cases, type I error rates and power are affected by the choice of statistical test.
SUBJECTS/METHODS: We conducted "plasmode"-based simulation using empirical data on body mass (weight) from murine genetic models of obesity. For the type I error simulation, the weight distributions were adjusted to ensure no difference in means between control and mutant groups. For the power simulation, the distributions of the mutant groups were shifted to ensure specific effect sizes. Three to twenty mice were resampled from the empirical distributions to create a plasmode. We then computed type I error rates and power for five common tests on the plasmodes: Student's t test, Welch's t test, Wilcoxon rank sum test (aka, Mann-Whitney U test), permutation test, and bootstrap test.
We observed type I error inflation for all tests, except the bootstrap test, with small samples (≤5). Type I error inflation decreased as sample size increased (≥8) but remained. The Wilcoxon test should be avoided because of heterogeneity of distributions. For power, a departure from the reference was observed with small samples for all tests. Compared with the other tests, the bootstrap test had less power with small samples.
Overall, the bootstrap test is recommended for small samples to avoid type I error inflation, but this benefit comes at the cost of lower power. When sample size is large enough, Welch's t test is recommended because of high power with minimal type I error inflation.
背景/目的:肥胖的遗传因素经常在小鼠模型中进行研究。然而,这些研究的样本量通常较小,并且数据可能违反常见统计检验的假设,例如分布的正态性。我们研究了在这些情况下,选择统计检验是否会影响Ⅰ类错误率和功效。
受试者/方法:我们使用肥胖小鼠遗传模型的体重(重量)的经验数据进行了基于“plasmode”的模拟。对于Ⅰ类错误模拟,调整体重分布以确保对照组和突变组之间的平均值没有差异。对于功效模拟,将突变组的分布移动以确保特定的效应大小。从经验分布中对 3 到 20 只老鼠进行重采样以创建 plasmode。然后,我们计算了五个常见测试的 plasmode 的Ⅰ类错误率和功效:学生 t 检验、Welch t 检验、Wilcoxon 秩和检验(又名 Mann-Whitney U 检验)、置换检验和自举检验。
我们观察到除自举检验外,所有检验的Ⅰ类错误率都有膨胀,尤其是小样本(≤5)。随着样本量的增加(≥8),Ⅰ类错误率的膨胀会降低,但仍会存在。由于分布不均一,应避免使用 Wilcoxon 检验。对于功效,所有检验在小样本时均观察到与参考值的偏差。与其他检验相比,小样本时自举检验的功效较低。
总体而言,对于小样本,建议使用自举检验以避免Ⅰ类错误率膨胀,但这会以降低功效为代价。当样本量足够大时,由于最小的Ⅰ类错误率膨胀和高功效,建议使用 Welch t 检验。