Department of Biology, University of Central Florida, Orlando, Florida, United States of America.
PLoS One. 2020 Feb 21;15(2):e0229345. doi: 10.1371/journal.pone.0229345. eCollection 2020.
Regressions and meta-regressions are widely used to estimate patterns and effect sizes in various disciplines. However, many biological and medical analyses use relatively low sample size (N), contributing to concerns on reproducibility. What is the minimum N to identify the most plausible data pattern using regressions? Statistical power analysis is often used to answer that question, but it has its own problems and logically should follow model selection to first identify the most plausible model. Here we make null, simple linear and quadratic data with different variances and effect sizes. We then sample and use information theoretic model selection to evaluate minimum N for regression models. We also evaluate the use of coefficient of determination (R2) for this purpose; it is widely used but not recommended. With very low variance, both false positives and false negatives occurred at N < 8, but data shape was always clearly identified at N ≥ 8. With high variance, accurate inference was stable at N ≥ 25. Those outcomes were consistent at different effect sizes. Akaike Information Criterion weights (AICc wi) were essential to clearly identify patterns (e.g., simple linear vs. null); R2 or adjusted R2 values were not useful. We conclude that a minimum N = 8 is informative given very little variance, but minimum N ≥ 25 is required for more variance. Alternative models are better compared using information theory indices such as AIC but not R2 or adjusted R2. Insufficient N and R2-based model selection apparently contribute to confusion and low reproducibility in various disciplines. To avoid those problems, we recommend that research based on regressions or meta-regressions use N ≥ 25.
回归和元回归广泛用于估计各个学科的模式和效应大小。然而,许多生物和医学分析使用相对较低的样本量(N),这引起了人们对可重复性的关注。使用回归来确定最合理的数据模式的最小 N 是多少?统计功效分析通常用于回答这个问题,但它也有自己的问题,并且逻辑上应该遵循模型选择,首先确定最合理的模型。在这里,我们生成具有不同方差和效应大小的零假设、简单线性和二次数据。然后,我们对数据进行采样,并使用信息论模型选择来评估回归模型的最小 N。我们还评估了决定系数(R2)在此目的中的使用;它被广泛使用,但不推荐使用。在方差非常低的情况下,N < 8 时会出现假阳性和假阴性,但在 N ≥ 8 时,数据形状始终可以清晰识别。在方差较高的情况下,在 N ≥ 25 时准确的推断是稳定的。在不同的效应大小下,这些结果都是一致的。Akaike 信息准则权重(AICc wi)对于清晰识别模式(例如,简单线性与零假设)至关重要;R2 或调整后的 R2 值没有用处。我们的结论是,在方差非常小的情况下,N = 8 是有信息的,但在方差较大的情况下,需要 N ≥ 25。使用信息论指数(如 AIC)而不是 R2 或调整后的 R2 来比较替代模型更好。在各个学科中,N 不足和基于 R2 的模型选择显然导致了混淆和低可重复性。为了避免这些问题,我们建议基于回归或元回归的研究使用 N ≥ 25。