Ambrosius Walter T, Lange Ethan M, Langefeld Carl D
Section on Biostatistics, Department of Public Health Sciences, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
Am J Hum Genet. 2004 Apr;74(4):683-93. doi: 10.1086/383282. Epub 2004 Mar 12.
One of the first and most important steps in planning a genetic association study is the accurate estimation of the statistical power under a proposed study design and sample size. In association studies for candidate genes or in fine-mapping applications, allele and genotype frequencies are often assumed to be known when, in fact, they are unknown (i.e., random variables from some distribution). For example, if we consider a diallelic marker with allele frequencies of 0.5 and 0.5 and Hardy-Weinberg proportions, the three genotype frequencies are often assumed to be 0.25, 0.50, and 0.25, and the statistical power is calculated. Unfortunately, ignoring this source of variation can inflate the estimated power of the study. In the present article, we propose averaging the estimates of power over the distribution of the genotype frequencies to calculate the true estimate of power for a fixed allele frequency. For the usual situation, in which allele frequencies in a population are not known, we propose placing a prior distribution on the allele frequency, taking advantage of any available genotype information. This Bayesian approach provides a more accurate estimate of power. We present examples for quantitative and qualitative traits in cohort studies of unrelated individuals and results from an extensive series of examples that show that ignoring the uncertainty in allele frequencies can inflate the estimated power of the study. We also present the results from case-control studies and show that standard methods may also overestimate power. As discussed in this article, the approach of fixing allele frequencies even if they are not known is the common approach to power calculations. We show that ignoring the sources of variation in allele frequencies tends to result in overestimates of power and, consequently, in studies that are underpowered. Software in C is available at http://www.ambrosius.net/Power/.
开展基因关联研究时,首要且重要的步骤之一是根据拟定的研究设计和样本量准确估计统计效能。在候选基因关联研究或精细定位应用中,通常假定等位基因和基因型频率已知,而实际上它们是未知的(即来自某种分布的随机变量)。例如,对于一个等位基因频率分别为0.5和0.5且符合哈迪-温伯格比例的双等位基因标记,常假定三种基因型频率分别为0.25、0.50和0.25,并据此计算统计效能。遗憾的是,忽略这种变异来源会夸大研究的估计效能。在本文中,我们建议针对固定的等位基因频率,在基因型频率分布上对等效能估计值求平均,以计算效能的真实估计值。对于群体中等位基因频率未知的常见情况,我们建议利用任何可用的基因型信息,对等位基因频率设定一个先验分布。这种贝叶斯方法能提供更准确的效能估计值。我们给出了无关个体队列研究中数量性状和质量性状的示例,以及一系列广泛示例的结果,这些结果表明忽略等位基因频率的不确定性会夸大研究的估计效能。我们还给出了病例对照研究的结果,并表明标准方法也可能高估效能。如本文所讨论的,即使等位基因频率未知仍将其固定的方法是计算效能的常用方法。我们表明,忽略等位基因频率的变异来源往往会导致对效能的高估,从而导致研究效能不足。可从http://www.ambrosius.net/Power/获取用C语言编写的软件。