高效的家系记录，实现快速的群体遗传学模拟。

Efficient pedigree recording for fast population genetics simulation.

机构信息

Big Data Institute, University of Oxford, Oxford, United Kingdom.

Ecology and Evolutionary Biology, University of California, Irvine, Irvine, California, United States of America.

出版信息

PLoS Comput Biol. 2018 Nov 1;14(11):e1006581. doi: 10.1371/journal.pcbi.1006581. eCollection 2018 Nov.

DOI:10.1371/journal.pcbi.1006581

PMID:30383757

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6233923/

Abstract

In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly 'simplify' a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation's entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit.

摘要

在本文中，我们描述了如何在具有任意繁殖模型、群体结构和人口统计学的正向个体基础群体遗传学模拟中，高效地记录群体的整个遗传历史。这种方法通过允许我们仅模拟那些可能影响繁殖的基因座（即具有非中性变体的基因座），极大地减少了跟踪个体基因组的计算负担。群体的遗传历史被记录为一个简洁的树序列，如软件包 msprime 中引入的那样，随后可以快速在其上放置中性突变。记录每个繁殖事件的结果需要线性增长的存储空间，但这些信息中有很大的冗余。我们通过提供一种算法来解决这个存储问题，该算法可以通过删除给定基因组集合的无关历史来快速“简化”树序列。通过定期根据现存群体简化历史，我们表明所需的总存储空间适中，并且相对于经典的正向时间模拟可以实现整体的高效率提升。我们实现了一个用于记录和简化系统发育数据的通用框架，可以用于使任何种群模型的模拟更有效率。我们修改了两个流行的正向时间模拟框架，以使用这种新方法，并观察到在一个到两个数量级的大型全基因组模拟中效率的提高。除了速度之外，我们记录系谱的方法还有几个优点：（1）记录了模拟个体的所有边缘系统发育，而不仅仅是基因型。（2）可以在 O(NlogN+M) 的空间中存储具有 N 个个体和 M 个多态性位点的群体，使得存储模拟的整个最终世代及其历史成为可能。（3）可以轻松地用更有效的深历史的合并模拟来初始化模拟。用于记录和处理树序列的软件名为 tskit。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/83f6/6233923/f50dcbf28e43/pcbi.1006581.g001.jpg

相似文献

Efficient pedigree recording for fast population genetics simulation.高效的家系记录，实现快速的群体遗传学模拟。

PLoS Comput Biol. 2018 Nov 1;14(11):e1006581. doi: 10.1371/journal.pcbi.1006581. eCollection 2018 Nov.

Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes.SLiM 中的树序列记录为全基因组的正向时间模拟开辟了新的视野。

Mol Ecol Resour. 2019 Mar;19(2):552-566. doi: 10.1111/1755-0998.12968. Epub 2019 Feb 21.

Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes.大样本量的高效合并模拟和谱系分析

PLoS Comput Biol. 2016 May 4;12(5):e1004842. doi: 10.1371/journal.pcbi.1004842. eCollection 2016 May.

Population genetic simulation: Benchmarking frameworks for non-standard models of natural selection.群体遗传模拟：非标准自然选择模型的基准框架。

Mol Ecol Resour. 2024 Apr;24(3):e13930. doi: 10.1111/1755-0998.13930. Epub 2024 Jan 21.

GENOMEPOP: a program to simulate genomes in populations.GENOMEPOP：一个用于模拟群体基因组的程序。

BMC Bioinformatics. 2008 Apr 30;9:223. doi: 10.1186/1471-2105-9-223.

Forward-time simulations of human populations with complex diseases.患有复杂疾病的人群的正向时间模拟。

PLoS Genet. 2007 Mar 23;3(3):e47. doi: 10.1371/journal.pgen.0030047. Epub 2007 Feb 15.

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.高效总结大样本中的关系：谱系学和基因组统计之间的一般对偶性。

Genetics. 2020 Jul;215(3):779-797. doi: 10.1534/genetics.120.303253. Epub 2020 May 1.

Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent.固定家系内的基因系谱和 Kingman 合并的稳健性。

Genetics. 2012 Apr;190(4):1433-45. doi: 10.1534/genetics.111.135574. Epub 2012 Jan 10.

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs.sim1000G：一个用于无关个体和基于家系设计的 R 语言中易于使用的遗传变异模拟器。

BMC Bioinformatics. 2019 Jan 15;20(1):26. doi: 10.1186/s12859-019-2611-1.

Efficient ancestry and mutation simulation with msprime 1.0.利用 msprime 1.0 进行高效的祖先和突变模拟。

Genetics. 2022 Mar 3;220(3). doi: 10.1093/genetics/iyab229.

引用本文的文献

Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes.针对数百个基因组的全基因组谱系进行稳健且准确的贝叶斯推断。

Nat Genet. 2025 Sep 8. doi: 10.1038/s41588-025-02317-9.

Fast Phenotype Simulation for Genotype Representation Graphs.基因型表示图的快速表型模拟

bioRxiv. 2025 Aug 20:2025.08.15.670378. doi: 10.1101/2025.08.15.670378.

SLiM 5: Eco-evolutionary simulations across multiple chromosomes and full genomes.简约模型5：跨多条染色体和全基因组的生态进化模拟

bioRxiv. 2025 Aug 11:2025.08.07.669155. doi: 10.1101/2025.08.07.669155.

Effects of using deep learning to predict the geographic origin of barley genebank accessions on genome-environment association studies.利用深度学习预测大麦基因库种质地理起源对基因组-环境关联研究的影响

Theor Appl Genet. 2025 Aug 12;138(9):211. doi: 10.1007/s00122-025-05003-w.

Phylo-rs: an extensible phylogenetic analysis library in rust.Phylo-rs：一个用Rust编写的可扩展系统发育分析库。

BMC Bioinformatics. 2025 Jul 29;26(1):197. doi: 10.1186/s12859-025-06234-w.

A genealogy-based approach for revealing ancestry-specific structures in admixed populations.一种基于系谱学的方法，用于揭示混合群体中特定祖先的结构。

Am J Hum Genet. 2025 Jul 17. doi: 10.1016/j.ajhg.2025.06.016.

Tsbrowse: an interactive browser for ancestral recombination graphs.Tsbrowse：一种用于祖先重组图的交互式浏览器。

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf393.

Sweeps in Space: Leveraging Geographic Data to Identify Beneficial Alleles in Anopheles gambiae.空间扫描：利用地理数据识别冈比亚按蚊中的有益等位基因。

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf141.

An ancient origin of the naked grains of maize.玉米裸粒的古老起源。

Proc Natl Acad Sci U S A. 2025 Jun 24;122(25):e2503748122. doi: 10.1073/pnas.2503748122. Epub 2025 Jun 17.

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

本文引用的文献

Deconstructing isolation-by-distance: The genomic consequences of limited dispersal.剖析距离隔离：有限扩散的基因组后果。

PLoS Genet. 2017 Aug 3;13(8):e1006911. doi: 10.1371/journal.pgen.1006911. eCollection 2017 Aug.

Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations.人类人口统计学历史影响不同人群的遗传风险预测。

Am J Hum Genet. 2017 Apr 6;100(4):635-649. doi: 10.1016/j.ajhg.2017.03.004. Epub 2017 Mar 30.

A Model of Compound Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from Complex-Disease GWAS Datasets.复合杂合功能丧失等位基因模型与复杂疾病全基因组关联研究数据集的观察结果大致相符。

PLoS Genet. 2017 Jan 19;13(1):e1006573. doi: 10.1371/journal.pgen.1006573. eCollection 2017 Jan.

SLiM 2: Flexible, Interactive Forward Genetic Simulations.SLiM 2：灵活、交互式正向遗传模拟。

Mol Biol Evol. 2017 Jan;34(1):230-240. doi: 10.1093/molbev/msw211. Epub 2016 Oct 3.

Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes.大样本量的高效合并模拟和谱系分析

PLoS Comput Biol. 2016 May 4;12(5):e1004842. doi: 10.1371/journal.pcbi.1004842. eCollection 2016 May.

The Genetic Cost of Neanderthal Introgression.尼安德特人基因渗入的遗传代价。

Genetics. 2016 Jun;203(2):881-91. doi: 10.1534/genetics.116.186890. Epub 2016 Apr 2.

The SMC' is a highly accurate approximation to the ancestral recombination graph.SMC' 是对祖先重组图的一种高度精确的近似。

Genetics. 2015 May;200(1):343-55. doi: 10.1534/genetics.114.173898. Epub 2015 Mar 17.

A C++ template library for efficient forward-time population genetic simulation of large populations.一个用于对大群体进行高效顺时群体遗传模拟的C++模板库。

Genetics. 2014 Sep;198(1):157-66. doi: 10.1534/genetics.114.165019. Epub 2014 Jun 20.

Coalescent simulation in continuous space: algorithms for large neighbourhood size.连续空间中的溯祖模拟：大邻域大小的算法

Theor Popul Biol. 2014 Aug;95:13-23. doi: 10.1016/j.tpb.2014.05.001. Epub 2014 Jun 5.

Distortion of genealogical properties when the sample is very large.当样本非常大时，系谱性质会发生扭曲。

Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2385-90. doi: 10.1073/pnas.1322709111. Epub 2014 Jan 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

高效的家系记录，实现快速的群体遗传学模拟。

Efficient pedigree recording for fast population genetics simulation.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献