Alexander David H, Novembre John, Lange Kenneth
Department of Biomathematics, University of California at Los Angeles, Los Angeles, California 90095, USA.
Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.
Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.
群体分层长期以来一直被认为是基因关联研究中的一个混杂因素。从多位点基因型数据推导出来的估计祖先成分,可用于对群体分层进行统计校正。一种流行的祖先成分估计技术是广泛应用的程序Structure所体现的基于模型的方法。另一种方法在程序EIGENSTRAT中实现,它依赖于主成分分析而非基于模型的估计,并且不直接给出混合比例。EIGENSTRAT越来越受欢迎,部分原因是与Structure相比它速度极快。我们提出了一种新算法和一个程序ADMIXTURE,用于对无关个体的祖先成分进行基于模型的估计。ADMIXTURE采用了Structure中嵌入的似然模型。然而,ADMIXTURE运行速度要快得多,能在几分钟内解决Structure需要数小时才能解决的问题。在我们的许多实验中,我们发现ADMIXTURE几乎与EIGENSTRAT一样快。ADMIXTURE运行时间的改进依赖于一种快速块松弛方案,该方案使用序列二次规划进行块更新,并结合了一种新颖的拟牛顿收敛加速方法。我们的算法在运行速度上也比程序FRAPPE中纳入的期望最大化(EM)算法的实现更快且更准确。我们的模拟表明,ADMIXTURE对潜在混合系数和祖先等位基因频率的最大似然估计与Structure的贝叶斯估计一样准确。在实际数据集上,ADMIXTURE的估计与来自Structure和EIGENSTRAT的估计直接可比。综合来看,我们的结果表明,ADMIXTURE的计算速度为在基于模型的祖先成分估计中使用大得多的标记集开辟了可能性,并且其估计适用于在关联研究中校正群体分层。