Center for Quantitative Medicine, University of Connecticut Health Center, Farmington, Connecticut, United States of America.
Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, United States of America.
PLoS One. 2013 Dec 20;8(12):e83079. doi: 10.1371/journal.pone.0083079. eCollection 2013.
In experiments with many statistical tests there is need to balance type I and type II error rates while taking multiplicity into account. In the traditional approach, the nominal [Formula: see text]-level such as 0.05 is adjusted by the number of tests, [Formula: see text], i.e., as 0.05/[Formula: see text]. Assuming that some proportion of tests represent "true signals", that is, originate from a scenario where the null hypothesis is false, power depends on the number of true signals and the respective distribution of effect sizes. One way to define power is for it to be the probability of making at least one correct rejection at the assumed [Formula: see text]-level. We advocate an alternative way of establishing how "well-powered" a study is. In our approach, useful for studies with multiple tests, the ranking probability [Formula: see text] is controlled, defined as the probability of making at least [Formula: see text] correct rejections while rejecting hypotheses with [Formula: see text] smallest P-values. The two approaches are statistically related. Probability that the smallest P-value is a true signal (i.e., [Formula: see text]) is equal to the power at the level [Formula: see text], to an very good excellent approximation. Ranking probabilities are also related to the false discovery rate and to the Bayesian posterior probability of the null hypothesis. We study properties of our approach when the effect size distribution is replaced for convenience by a single "typical" value taken to be the mean of the underlying distribution. We conclude that its performance is often satisfactory under this simplification; however, substantial imprecision is to be expected when [Formula: see text] is very large and [Formula: see text] is small. Precision is largely restored when three values with the respective abundances are used instead of a single typical effect size value.
在进行许多统计检验的实验时,需要在考虑多重性的同时平衡Ⅰ类错误率和Ⅱ类错误率。在传统方法中,名义[公式:见文本]水平(例如 0.05)通过检验数量[公式:见文本]进行调整,即 0.05/[公式:见文本]。假设一些检验代表“真实信号”,也就是说,来源于零假设为假的情况下,功效取决于真实信号的数量和各自的效应大小分布。定义功效的一种方法是将其定义为在假设的[公式:见文本]水平下至少有一次正确拒绝的概率。我们提倡另一种确定研究“功效”的方法。在我们的方法中,对于具有多个检验的研究很有用,排序概率[公式:见文本]被控制,定义为在拒绝具有[公式:见文本]最小 P 值的假设时至少做出[公式:见文本]次正确拒绝的概率。这两种方法在统计学上是相关的。最小 P 值是真实信号(即[公式:见文本])的概率等于在水平[公式:见文本]下的功效,这是一个非常好的近似值。排序概率也与假发现率和零假设的贝叶斯后验概率有关。当为方便起见将效应大小分布替换为单个“典型”值时,我们研究了这种方法的性质,该值被视为基础分布的平均值。我们的结论是,在这种简化下,它的性能通常是令人满意的;然而,当[公式:见文本]非常大且[公式:见文本]很小时,预计会有很大的不准确性。当使用三个具有相应丰度的数值而不是单个典型效应大小值时,精度会得到很大恢复。