Coram Marc, Tang Hua
Department of Health Research and Policy, Stanford University, Stanford, California 94305, USA.
Ann Appl Stat. 2007 Dec 12;1(2):459-479. doi: 10.1214/07-aoas121.
Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.
估计遗传标记的等位基因频率是生物学和生物医学研究中的关键要素,例如在人类遗传变异研究或可遗传性状的遗传病因学研究中。随着遗传数据越来越容易获取,研究人员面临一个困境:何时应将其他研究和人群亚组的数据与主要数据合并?合并额外的样本通常会降低频率估计值的方差;然而,如果使用不当,由于群体分层,合并估计值可能会出现严重偏差。由于存在这种潜在偏差,大多数研究人员避免合并,即使是对于具有相同种族背景且居住在同一大陆的样本也是如此。在此,我们提出一种经验贝叶斯方法来估计单核苷酸多态性的等位基因频率。该程序会自适应地纳入相关样本的基因型,从而使更相似的样本对估计值有更大的影响。在我们考虑的每个例子中,我们的估计器实现的均方误差(MSE)比合并或不合并的情况都要小,有时在两种极端情况下都有显著改善。如一项与实际数据示例仔细匹配的模拟研究所示,引入的偏差很小。当对一小群个体进行大量标记的基因分型时,我们的方法特别有用,这种情况在全基因组关联研究中很可能会遇到。