黑麦：生物库规模的遗传祖先推断。

Rye: genetic ancestry inference at biobank scale.

机构信息

National Institute on Minority Health and Health Disparities, National Institutes of Health, Bethesda, MD, USA.

IHRC-Georgia Tech Applied Bioinformatics Laboratory, Atlanta, GA, USA.

出版信息

Nucleic Acids Res. 2023 May 8;51(8):e44. doi: 10.1093/nar/gkad149.

DOI:10.1093/nar/gkad149

PMID:36928108

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10164567/

Abstract

Biobank projects are generating genomic data for many thousands of individuals. Computational methods are needed to handle these massive data sets, including genetic ancestry (GA) inference tools. Current methods for GA inference do not scale to biobank-size genomic datasets. We present Rye-a new algorithm for GA inference at biobank scale. We compared the accuracy and runtime performance of Rye to the widely used RFMix, ADMIXTURE and iAdmix programs and applied it to a dataset of 488221 genome-wide variant samples from the UK Biobank. Rye infers GA based on principal component analysis of genomic variant samples from ancestral reference populations and query individuals. The algorithm's accuracy is powered by Metropolis-Hastings optimization and its speed is provided by non-negative least squares regression. Rye produces highly accurate GA estimates for three-way admixed populations-African, European and Native American-compared to RFMix and ADMIXTURE (${R}^2 = \ 0.998 - 1.00$), and shows 50× runtime improvement compared to ADMIXTURE on the UK Biobank dataset. Rye analysis of UK Biobank samples demonstrates how it can be used to infer GA at both continental and subcontinental levels. We discuss user consideration and options for the use of Rye; the program and its documentation are distributed on the GitHub repository: https://github.com/healthdisparities/rye.

摘要

生物库项目正在为数千人产生基因组数据。需要计算方法来处理这些大规模数据集，包括遗传祖先（GA）推断工具。当前的 GA 推断方法无法扩展到生物库大小的基因组数据集。我们提出了 Rye——一种用于生物库规模 GA 推断的新算法。我们比较了 Rye 与广泛使用的 RFMix、ADMIXTURE 和 iAdmix 程序的准确性和运行时性能，并将其应用于来自英国生物库的 488221 个全基因组变异样本数据集。Rye 根据来自祖先参考群体和查询个体的基因组变异样本的主成分分析来推断 GA。该算法的准确性由 Metropolis-Hastings 优化提供，其速度由非负最小二乘回归提供。与 RFMix 和 ADMIXTURE 相比，Rye 为三种混合人群（非洲人、欧洲人和美洲原住民）生成了高度准确的 GA 估计（${R}^2 = \ 0.998 - 1.00$），并且在英国生物库数据集上比 ADMIXTURE 快 50 倍。Rye 对英国生物库样本的分析展示了它如何用于推断大陆和次大陆级别的 GA。我们讨论了 Rye 的用户考虑因素和使用选项；该程序及其文档在 GitHub 存储库上分发：https://github.com/healthdisparities/rye。

相似文献

Rye: genetic ancestry inference at biobank scale.黑麦：生物库规模的遗传祖先推断。

Nucleic Acids Res. 2023 May 8;51(8):e44. doi: 10.1093/nar/gkad149.

Inferring population structure in biobank-scale genomic data.推断生物库规模基因组数据中的群体结构。

Am J Hum Genet. 2022 Apr 7;109(4):727-737. doi: 10.1016/j.ajhg.2022.02.015. Epub 2022 Mar 16.

Putting RFMix and ADMIXTURE to the test in a complex admixed population.在一个复杂的混合人群中检验 RFMix 和 ADMIXTURE。

BMC Genet. 2020 Apr 7;21(1):40. doi: 10.1186/s12863-020-00845-3.

An ancestry informative marker panel design for individual ancestry estimation of Hispanic population using whole exome sequencing data.基于全外显子组测序数据的西班牙裔个体祖籍信息标记面板设计用于个体祖籍估计。

BMC Genomics. 2019 Dec 30;20(Suppl 12):1007. doi: 10.1186/s12864-019-6333-6.

Comparing local ancestry inference models in populations of two- and three-way admixture.比较两向和三向混合群体中的本地祖先推断模型。

PeerJ. 2020 Oct 2;8:e10090. doi: 10.7717/peerj.10090. eCollection 2020.

FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data.FastPop：一种利用遗传数据推断洲际血统的快速主成分衍生方法。

BMC Bioinformatics. 2016 Mar 9;17:122. doi: 10.1186/s12859-016-0965-1.

AICRF: ancestry inference of admixed population with deep conditional random field.AICRF：基于深度条件随机场的混合人群祖籍推断。

J Genet. 2023;102.

Haplotype estimation for biobank-scale data sets.生物样本库规模数据集的单倍型估计

Nat Genet. 2016 Jul;48(7):817-20. doi: 10.1038/ng.3583. Epub 2016 Jun 6.

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets.无监督发现生物库规模数据集的祖先信息标记和遗传混合比例。

Am J Hum Genet. 2023 Feb 2;110(2):314-325. doi: 10.1016/j.ajhg.2022.12.008. Epub 2023 Jan 6.

Leveraging genomic diversity for discovery in an electronic health record linked biobank: the UCLA ATLAS Community Health Initiative.利用电子健康记录关联生物库中的基因组多样性进行发现：加州大学洛杉矶分校 ATLAS 社区健康倡议。

Genome Med. 2022 Sep 9;14(1):104. doi: 10.1186/s13073-022-01106-x.

引用本文的文献

Classification of Heterotic Groups and Prediction of Heterosis in Sorghum Based on Whole-Genome Resequencing.基于全基因组重测序的高粱杂种优势群分类及杂种优势预测

Int J Mol Sci. 2025 Aug 18;26(16):7950. doi: 10.3390/ijms26167950.

Polygenic risk scores for prostate cancer: Comparative evaluations in UK and Australian cohorts.前列腺癌的多基因风险评分：英国和澳大利亚队列的比较评估。

HGG Adv. 2025 Jul 7;6(4):100477. doi: 10.1016/j.xhgg.2025.100477.

African ancestry-enriched variants in the gene are associated with elevated serum creatinine levels.该基因中富含非洲血统的变异与血清肌酐水平升高有关。

medRxiv. 2025 Mar 9:2025.03.07.25323581. doi: 10.1101/2025.03.07.25323581.

Genetic ancestry and population structure in the All of Us Research Program cohort.“我们所有人”研究项目队列中的遗传血统和种群结构

Nat Commun. 2025 May 3;16(1):4123. doi: 10.1038/s41467-025-59351-8.

SLCO1B1 Functional Variants, Bilirubin, Statin-Induced Myotoxicity, and Recent Sub-Saharan African Ancestry: A Precision Medicine Health Equity Study.溶质载体有机阴离子转运体家族1B成员1（SLCO1B1）功能变体、胆红素、他汀类药物诱导的肌毒性与近期撒哈拉以南非洲血统：一项精准医学健康公平性研究

Clin Pharmacol Ther. 2025 Jun;117(6):1696-1705. doi: 10.1002/cpt.3624. Epub 2025 Mar 6.

PANE: fast and reliable ancestral reconstruction on ancient genotype data with non-negative least square and principal component analysis.PANE：基于非负最小二乘法和主成分分析对古代基因型数据进行快速可靠的祖先重建。

Genome Biol. 2025 Feb 11;26(1):29. doi: 10.1186/s13059-025-03491-z.

Large-scale genetic characterization of Parkinson's disease in the African and African admixed populations.非洲及非洲裔混血人群中帕金森病的大规模基因特征分析。

medRxiv. 2025 Jan 20:2025.01.14.25320205. doi: 10.1101/2025.01.14.25320205.

A genome-wide association study identifies genetic variants associated with hip pain in the UK Biobank cohort (N = 221,127).一项全基因组关联研究在英国生物银行队列（N = 221,127）中确定了与髋关节疼痛相关的基因变异。

Sci Rep. 2025 Jan 22;15(1):2812. doi: 10.1038/s41598-025-85871-w.

SEAD reference panel with 22,134 haplotypes boosts rare variant imputation and genome-wide association analysis in Asian populations.拥有22,134个单倍型的SEAD参考面板增强了亚洲人群中罕见变异的归因和全基因组关联分析。

Nat Commun. 2024 Dec 30;15(1):10839. doi: 10.1038/s41467-024-55147-4.

The evolutionary fate of Neanderthal DNA in 30,780 admixed genomes with recent African-like ancestry.30780个具有近期非洲样血统的混合基因组中尼安德特人DNA的进化命运。

bioRxiv. 2024 Jul 25:2024.07.25.605203. doi: 10.1101/2024.07.25.605203.

本文引用的文献

A unified genealogy of modern and ancient genomes.现代和古代基因组的统一族谱。

Science. 2022 Feb 25;375(6583):eabi8264. doi: 10.1126/science.abi8264.

Genetic ancestry and ethnic identity in Ecuador.厄瓜多尔的遗传血统与种族身份。

HGG Adv. 2021 Aug 20;2(4):100050. doi: 10.1016/j.xhgg.2021.100050. eCollection 2021 Oct 14.

An Overview of Strategies for Detecting Genotype-Phenotype Associations Across Ancestrally Diverse Populations.跨祖先多样化人群检测基因型-表型关联的策略概述

Front Genet. 2021 Nov 5;12:703901. doi: 10.3389/fgene.2021.703901. eCollection 2021.

Comparing Genetic and Socioenvironmental Contributions to Ethnic Differences in C-Reactive Protein.比较遗传因素和社会环境因素对C反应蛋白种族差异的影响。

Front Genet. 2021 Oct 18;12:738485. doi: 10.3389/fgene.2021.738485. eCollection 2021.

Paths and timings of the peopling of Polynesia inferred from genomic networks.基于基因组网络推断的波利尼西亚人群的迁徙路径和时间。

Nature. 2021 Sep;597(7877):522-526. doi: 10.1038/s41586-021-03902-8. Epub 2021 Sep 22.

Socioeconomic deprivation and genetic ancestry interact to modify type 2 diabetes ethnic disparities in the United Kingdom.在英国，社会经济剥夺与遗传血统相互作用，改变了2型糖尿病的种族差异。

EClinicalMedicine. 2021 Jun 14;37:100960. doi: 10.1016/j.eclinm.2021.100960. eCollection 2021 Jul.

Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power.拖拉机使用本地血统来实现混合个体在 GWAS 中的纳入，并提高了研究的效力。

Nat Genet. 2021 Feb;53(2):195-204. doi: 10.1038/s41588-020-00766-y. Epub 2021 Jan 18.

Race and Genetic Ancestry in Medicine - A Time for Reckoning with Racism.医学中的种族与遗传血统——正视种族主义的时刻

N Engl J Med. 2021 Feb 4;384(5):474-480. doi: 10.1056/NEJMms2029562. Epub 2021 Jan 6.

Insights into human genetic variation and population history from 929 diverse genomes.从 929 个不同的基因组中深入了解人类遗传变异和人口历史。

Science. 2020 Mar 20;367(6484). doi: 10.1126/science.aay5012.

What is ancestry?什么是血统？

PLoS Genet. 2020 Mar 9;16(3):e1008624. doi: 10.1371/journal.pgen.1008624. eCollection 2020 Mar.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验