在大型人群数据集推断全基因组历史。

Inferring whole-genome histories in large population datasets.

机构信息

Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK.

出版信息

Nat Genet. 2019 Sep;51(9):1330-1338. doi: 10.1038/s41588-019-0483-y. Epub 2019 Sep 2.

DOI:10.1038/s41588-019-0483-y

PMID:31477934

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6726478/

Abstract

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.

摘要

推断一组 DNA 序列的完整谱系历史是进化生物学中的核心问题，因为这段历史编码了影响物种的事件和力量的信息。然而，目前的方法存在局限性，最准确的技术能够处理的样本不超过一百个。由于现在正在收集包含数百万个基因组的数据集，因此需要可扩展和高效的推断方法来充分利用这些资源。在这里，我们介绍了一种算法，它不仅能够以与最先进技术相当的准确性推断全基因组历史，还能够处理四个数量级更多的序列。该方法还提供了数据的“进化编码”，能够有效地计算相关统计信息。我们将该方法应用于来自 1000 基因组计划、西蒙斯基因组多样性计划和英国生物库的人类数据，结果表明，推断出的系统发育树富含生物学信号，并且处理效率很高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c8aa/6726478/293cf949b27c/EMS83740-f001.jpg

相似文献

Inferring whole-genome histories in large population datasets.在大型人群数据集推断全基因组历史。

Nat Genet. 2019 Sep;51(9):1330-1338. doi: 10.1038/s41588-019-0483-y. Epub 2019 Sep 2.

A method for genome-wide genealogy estimation for thousands of samples.一种用于对数千个样本进行全基因组谱系估计的方法。

Nat Genet. 2019 Sep;51(9):1321-1329. doi: 10.1038/s41588-019-0484-x. Epub 2019 Sep 2.

Robust inference of population size histories from genomic sequencing data.从基因组测序数据中推断种群规模历史。

PLoS Comput Biol. 2022 Sep 16;18(9):e1010419. doi: 10.1371/journal.pcbi.1010419. eCollection 2022 Sep.

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.高效总结大样本中的关系：谱系学和基因组统计之间的一般对偶性。

Genetics. 2020 Jul;215(3):779-797. doi: 10.1534/genetics.120.303253. Epub 2020 May 1.

Rapid detection of identity-by-descent tracts for mega-scale datasets.大规模数据集的同源片段快速检测

Nat Commun. 2021 Jun 10;12(1):3546. doi: 10.1038/s41467-021-22910-w.

The Promise of Inferring the Past Using the Ancestral Recombination Graph.利用祖先重组图谱推断过去的可能性。

Genome Biol Evol. 2024 Feb 1;16(2). doi: 10.1093/gbe/evae005.

Dating genomic variants and shared ancestry in population-scale sequencing data.在大规模测序数据中追溯基因组变异和共同祖先。

PLoS Biol. 2020 Jan 17;18(1):e3000586. doi: 10.1371/journal.pbio.3000586. eCollection 2020 Jan.

Inferring demographic and selective histories from population genomic data using a 2-step approach in species with coding-sparse genomes: an application to human data.在编码基因稀疏的物种中，使用两步法从群体基因组数据推断种群统计学和选择历史：对人类数据的应用

G3 (Bethesda). 2025 Apr 17;15(4). doi: 10.1093/g3journal/jkaf019.

A method to correct for the effects of purifying selection on genealogical inference.一种校正净化选择对系统发育推断影响的方法。

Mol Biol Evol. 2010 Oct;27(10):2406-16. doi: 10.1093/molbev/msq132. Epub 2010 May 31.

RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination.RENT+：一种从存在重组的单倍型推断局部系谱树的改进方法。

Bioinformatics. 2017 Apr 1;33(7):1021-1030. doi: 10.1093/bioinformatics/btw735.

引用本文的文献

Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes.针对数百个基因组的全基因组谱系进行稳健且准确的贝叶斯推断。

Nat Genet. 2025 Sep 8. doi: 10.1038/s41588-025-02317-9.

GHIST 2024: The 1st Genomic History Inference Strategies Tournament.GHIST 2024：第一届基因组历史推断策略竞赛。

bioRxiv. 2025 Aug 11:2025.08.05.668560. doi: 10.1101/2025.08.05.668560.

Benchmarking and optimization of methods for the detection of identity-by-descent in high-recombining genomes.高重组基因组中同源基因检测方法的基准测试与优化

Elife. 2025 Aug 19;14:RP101924. doi: 10.7554/eLife.101924.

The Length of Haplotype Blocks and Signals of Structural Variation in Reconstructed Genealogies.重构谱系中单体型块的长度及结构变异信号

Mol Biol Evol. 2025 Sep 1;42(9). doi: 10.1093/molbev/msaf190.

Radiation with reproductive isolation in the near-absence of phylogenetic signal.在几乎没有系统发育信号的情况下出现辐射与生殖隔离。

Sci Adv. 2025 Jul 25;11(30):eadt0973. doi: 10.1126/sciadv.adt0973.

Coalescence and Translation: A Language Model for Population Genetics.合并与翻译：一种用于群体遗传学的语言模型

bioRxiv. 2025 Jun 27:2025.06.24.661337. doi: 10.1101/2025.06.24.661337.

Power and Limitations of Inferring Genetic Ancestry.推断遗传血统的能力与局限性

Ann Hum Genet. 2025 Sep;89(5):264-273. doi: 10.1111/ahg.70007. Epub 2025 Jul 15.

Tsbrowse: an interactive browser for ancestral recombination graphs.Tsbrowse：一种用于祖先重组图的交互式浏览器。

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf393.

The TMRCA of general genealogies in populations with deterministically varying size.大小确定性变化人群中一般谱系的最近共同祖先时间

Theor Popul Biol. 2025 Jul 2;165:1-9. doi: 10.1016/j.tpb.2025.06.002.

Recent Statistical Innovations in Human Genetics.人类遗传学领域的最新统计创新

Ann Hum Genet. 2025 Sep;89(5):241-254. doi: 10.1111/ahg.12606. Epub 2025 Jun 27.

本文引用的文献

Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data.利用单核苷酸多态性数据对连锁不平衡进行建模并识别重组热点。

Genetics. 2003 Dec;165(4):2213-33. doi: 10.1093/genetics/165.4.2213.

The wabbler-lethal mouse. An electron microscopic study of the nervous system.摇摆致死小鼠。神经系统的电子显微镜研究。

Arch Neurol. 1967 Aug;17(2):153-61. doi: 10.1001/archneur.1967.00470260043004.

[Rare anomalies of the ureter].[输尿管的罕见异常]

Minerva Radiol. 1965 Nov;10(11):531-42.

Peripheral hemodynamic stability during prolonged anesthesia in the rat.大鼠长时间麻醉期间的外周血流动力学稳定性

Microsurgery. 1986;7(4):178-82. doi: 10.1002/micr.1920070410.

Neuromuscular disease, respiratory failure and cor pulmonale.神经肌肉疾病、呼吸衰竭和肺心病。

Postgrad Med J. 1992 Oct;68(804):820-3. doi: 10.1136/pgmj.68.804.820.

[Neonatal sepsis caused by Haemophilus influenzae in the first few days of life].出生后最初几天由流感嗜血杆菌引起的新生儿败血症

Ned Tijdschr Geneeskd. 1992 Nov 28;136(48):2386-7.

[Studies on lipid mobilization in obesity without glucose intolerance. 1st communication. Noradrenaline-stimulated lipolysis].[无葡萄糖不耐受的肥胖患者脂质动员的研究。首次通讯。去甲肾上腺素刺激的脂肪分解]

Endokrinologie. 1975 Dec;66(3):337-47.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在大型人群数据集推断全基因组历史。

Inferring whole-genome histories in large population datasets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献