Jiroutek Michael R, Muller Keith E, Kupper Lawrence L, Stewart Paul W
Bristol-Myers Squibb Pharmaceutical Research Institute, 5 Research Parkway, Wallingford, Connecticut 06492-7660, USA.
Biometrics. 2003 Sep;59(3):580-90. doi: 10.1111/1541-0420.00068.
Scientists often need to test hypotheses and construct corresponding confidence intervals. In designing a study to test a particular null hypothesis, traditional methods lead to a sample size large enough to provide sufficient statistical power. In contrast, traditional methods based on constructing a confidence interval lead to a sample size likely to control the width of the interval. With either approach, a sample size so large as to waste resources or introduce ethical concerns is undesirable. This work was motivated by the concern that existing sample size methods often make it difficult for scientists to achieve their actual goals. We focus on situations which involve a fixed, unknown scalar parameter representing the true state of nature. The width of the confidence interval is defined as the difference between the (random) upper and lower bounds. An event width is said to occur if the observed confidence interval width is less than a fixed constant chosen a priori. An event validity is said to occur if the parameter of interest is contained between the observed upper and lower confidence interval bounds. An event rejection is said to occur if the confidence interval excludes the null value of the parameter. In our opinion, scientists often implicitly seek to have all three occur: width, validity, and rejection. New results illustrate that neglecting rejection or width (and less so validity) often provides a sample size with a low probability of the simultaneous occurrence of all three events. We recommend considering all three events simultaneously when choosing a criterion for determining a sample size. We provide new theoretical results for any scalar (mean) parameter in a general linear model with Gaussian errors and fixed predictors. Convenient computational forms are included, as well as numerical examples to illustrate our methods.
科学家们常常需要检验假设并构建相应的置信区间。在设计一项研究以检验某个特定的零假设时,传统方法会得出一个足够大的样本量,以提供足够的统计功效。相比之下,基于构建置信区间的传统方法会得出一个可能用于控制区间宽度的样本量。采用这两种方法中的任何一种,样本量过大以至于浪费资源或引发伦理问题都是不可取的。这项工作的动机源于这样一种担忧,即现有的样本量计算方法常常使科学家难以实现他们的实际目标。我们关注的是涉及一个代表真实自然状态的固定、未知标量参数的情况。置信区间的宽度被定义为(随机的)上限和下限之间的差值。如果观察到的置信区间宽度小于事先选定的一个固定常数,就称发生了宽度事件。如果感兴趣的参数包含在观察到的置信区间上下限之间,就称发生了有效性事件。如果置信区间不包含参数的零值,就称发生了拒绝事件。在我们看来,科学家们常常隐含地希望这三种情况都发生:宽度、有效性和拒绝。新的结果表明,忽视拒绝或宽度(较少忽视有效性)往往会得出一个所有三种情况同时发生的概率较低的样本量。我们建议在选择确定样本量的标准时同时考虑这三种情况。我们为具有高斯误差和固定预测变量的一般线性模型中的任何标量(均值)参数提供了新的理论结果。文中包含了方便的计算形式以及数值示例来说明我们的方法。