National Institute on Minority Health and Health Disparities, National Institutes of Health, Bethesda, MD, USA.
IHRC-Georgia Tech Applied Bioinformatics Laboratory, Atlanta, GA, USA.
Nucleic Acids Res. 2023 May 8;51(8):e44. doi: 10.1093/nar/gkad149.
Biobank projects are generating genomic data for many thousands of individuals. Computational methods are needed to handle these massive data sets, including genetic ancestry (GA) inference tools. Current methods for GA inference do not scale to biobank-size genomic datasets. We present Rye-a new algorithm for GA inference at biobank scale. We compared the accuracy and runtime performance of Rye to the widely used RFMix, ADMIXTURE and iAdmix programs and applied it to a dataset of 488221 genome-wide variant samples from the UK Biobank. Rye infers GA based on principal component analysis of genomic variant samples from ancestral reference populations and query individuals. The algorithm's accuracy is powered by Metropolis-Hastings optimization and its speed is provided by non-negative least squares regression. Rye produces highly accurate GA estimates for three-way admixed populations-African, European and Native American-compared to RFMix and ADMIXTURE (${R}^2 = \ 0.998 - 1.00$), and shows 50× runtime improvement compared to ADMIXTURE on the UK Biobank dataset. Rye analysis of UK Biobank samples demonstrates how it can be used to infer GA at both continental and subcontinental levels. We discuss user consideration and options for the use of Rye; the program and its documentation are distributed on the GitHub repository: https://github.com/healthdisparities/rye.
生物库项目正在为数千人产生基因组数据。需要计算方法来处理这些大规模数据集,包括遗传祖先(GA)推断工具。当前的 GA 推断方法无法扩展到生物库大小的基因组数据集。我们提出了 Rye——一种用于生物库规模 GA 推断的新算法。我们比较了 Rye 与广泛使用的 RFMix、ADMIXTURE 和 iAdmix 程序的准确性和运行时性能,并将其应用于来自英国生物库的 488221 个全基因组变异样本数据集。Rye 根据来自祖先参考群体和查询个体的基因组变异样本的主成分分析来推断 GA。该算法的准确性由 Metropolis-Hastings 优化提供,其速度由非负最小二乘回归提供。与 RFMix 和 ADMIXTURE 相比,Rye 为三种混合人群(非洲人、欧洲人和美洲原住民)生成了高度准确的 GA 估计(${R}^2 = \ 0.998 - 1.00$),并且在英国生物库数据集上比 ADMIXTURE 快 50 倍。Rye 对英国生物库样本的分析展示了它如何用于推断大陆和次大陆级别的 GA。我们讨论了 Rye 的用户考虑因素和使用选项;该程序及其文档在 GitHub 存储库上分发:https://github.com/healthdisparities/rye。