当样本非常大时，系谱性质会发生扭曲。

Distortion of genealogical properties when the sample is very large.

机构信息

Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2385-90. doi: 10.1073/pnas.1322709111. Epub 2014 Jan 27.

DOI:10.1073/pnas.1322709111

PMID:24469801

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3926037/

Abstract

Study sample sizes in human genetics are growing rapidly, and in due course it will become routine to analyze samples with hundreds of thousands, if not millions, of individuals. In addition to posing computational challenges, such large sample sizes call for carefully reexamining the theoretical foundation underlying commonly used analytical tools. Here, we study the accuracy of the coalescent, a central model for studying the ancestry of a sample of individuals. The coalescent arises as a limit of a large class of random mating models, and it is an accurate approximation to the original model provided that the population size is sufficiently larger than the sample size. We develop a method for performing exact computation in the discrete-time Wright-Fisher (DTWF) model and compare several key genealogical quantities of interest with the coalescent predictions. For recently inferred demographic scenarios, we find that there are a significant number of multiple- and simultaneous-merger events under the DTWF model, which are absent in the coalescent by construction. Furthermore, for large sample sizes, there are noticeable differences in the expected number of rare variants between the coalescent and the DTWF model. To balance the trade-off between accuracy and computational efficiency, we propose a hybrid algorithm that uses the DTWF model for the recent past and the coalescent for the more distant past. Our results demonstrate that the hybrid method with only a handful of generations of the DTWF model leads to a frequency spectrum that is quite close to the prediction of the full DTWF model.

摘要

人类遗传学中的研究样本规模正在迅速增长，在适当的时候，分析包含数十万甚至数百万个体的样本将成为常规操作。除了带来计算挑战外，如此大的样本规模还需要仔细重新审视常用分析工具所依据的理论基础。在这里，我们研究了合并模型（coalescent）的准确性，该模型是研究个体样本祖先的核心模型。合并模型是从一大类随机交配模型中得出的极限，并且只要群体大小足够大于样本大小，它就是原始模型的精确近似。我们开发了一种在离散时间 Wright-Fisher（DTWF）模型中进行精确计算的方法，并将几个关键的感兴趣的系统发育数量与合并模型的预测进行了比较。对于最近推断出的人口统计场景，我们发现在 DTWF 模型下存在大量的多次合并和同时合并事件，而在合并模型中根据构造这些事件是不存在的。此外，对于较大的样本大小，在合并模型和 DTWF 模型之间，稀有变异的预期数量存在明显差异。为了在准确性和计算效率之间取得平衡，我们提出了一种混合算法，该算法将 DTWF 模型用于最近的过去，而将合并模型用于更远的过去。我们的结果表明，混合方法仅使用少数几代 DTWF 模型就可以得到与完整 DTWF 模型的预测非常接近的频谱。

相似文献

Distortion of genealogical properties when the sample is very large.当样本非常大时，系谱性质会发生扭曲。

Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2385-90. doi: 10.1073/pnas.1322709111. Epub 2014 Jan 27.

ARGON: fast, whole-genome simulation of the discrete time Wright-fisher process.ARGON：离散时间赖特-费希尔过程的快速全基因组模拟。

Bioinformatics. 2016 Oct 1;32(19):3032-4. doi: 10.1093/bioinformatics/btw355. Epub 2016 Jun 16.

Scaling the discrete-time Wright-Fisher model to biobank-scale datasets.将离散时间 Wright-Fisher 模型扩展到生物库规模数据集。

Genetics. 2023 Nov 1;225(3). doi: 10.1093/genetics/iyad168.

Exact coalescent for the Wright-Fisher model.赖特-费希尔模型的精确合并理论

Theor Popul Biol. 2006 Jun;69(4):385-94. doi: 10.1016/j.tpb.2005.11.005. Epub 2006 Jan 19.

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets.将离散时间赖特-费希尔模型扩展到生物样本库规模的数据集。

bioRxiv. 2023 May 22:2023.05.19.541517. doi: 10.1101/2023.05.19.541517.

An efficient algorithm for generating the internal branches of a Kingman coalescent.一种用于生成金曼合并过程内部分支的高效算法。

Theor Popul Biol. 2018 Jul;122:57-66. doi: 10.1016/j.tpb.2017.05.002. Epub 2017 Jul 11.

Single and simultaneous binary mergers in Wright-Fisher genealogies.赖特-费希尔系谱中的单重和同时二元合并

Theor Popul Biol. 2018 May;121:60-71. doi: 10.1016/j.tpb.2018.04.001. Epub 2018 Apr 12.

The Wright-Fisher site frequency spectrum as a perturbation of the coalescent's.作为合并过程扰动的赖特-费希尔位点频率谱。

Theor Popul Biol. 2018 Dec;124:81-92. doi: 10.1016/j.tpb.2018.09.005. Epub 2018 Oct 9.

Structured coalescent processes from a modified Moran model with large offspring numbers.源自具有大量后代数量的修正莫兰模型的结构化合并过程。

Theor Popul Biol. 2009 Sep;76(2):92-104. doi: 10.1016/j.tpb.2009.05.001. Epub 2009 May 9.

How to infer relative fitness from a sample of genomic sequences.如何从基因组序列样本中推断相对适合度。

Genetics. 2014 Jul;197(3):913-23. doi: 10.1534/genetics.113.160986. Epub 2014 Apr 26.

引用本文的文献

Effective Population Size Estimation in Large Marine Populations: Considering Current Challenges and Opportunities When Simulating Large Data Sets With High-Density Genomic Information.大型海洋种群有效种群大小的估计：在利用高密度基因组信息模拟大型数据集时考虑当前的挑战与机遇

Evol Appl. 2025 Jul 28;18(8):e70121. doi: 10.1111/eva.70121. eCollection 2025 Aug.

Fast simulation of identity-by-descent segments.同源片段的快速模拟。

Bull Math Biol. 2025 May 23;87(7):84. doi: 10.1007/s11538-025-01464-8.

Likelihoods for a general class of ARGs under the SMC.在顺序蒙特卡罗方法下一类一般的祖先重组图的似然性。

bioRxiv. 2025 Feb 27:2025.02.24.639977. doi: 10.1101/2025.02.24.639977.

Labelled histories with multifurcation and simultaneity.带有多分支和同时性的标记病史。

Philos Trans R Soc Lond B Biol Sci. 2025 Feb 13;380(1919):20230307. doi: 10.1098/rstb.2023.0307. Epub 2025 Feb 20.

Fast simulation of identity-by-descent segments.同源片段的快速模拟。

bioRxiv. 2025 Jan 7:2024.12.13.628449. doi: 10.1101/2024.12.13.628449.

The Promise of Inferring the Past Using the Ancestral Recombination Graph.利用祖先重组图谱推断过去的可能性。

Genome Biol Evol. 2024 Feb 1;16(2). doi: 10.1093/gbe/evae005.

Scaling the discrete-time Wright-Fisher model to biobank-scale datasets.将离散时间 Wright-Fisher 模型扩展到生物库规模数据集。

Genetics. 2023 Nov 1;225(3). doi: 10.1093/genetics/iyad168.

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets.将离散时间赖特-费希尔模型扩展到生物样本库规模的数据集。

bioRxiv. 2023 May 22:2023.05.19.541517. doi: 10.1101/2023.05.19.541517.

Efficient ancestry and mutation simulation with msprime 1.0.利用 msprime 1.0 进行高效的祖先和突变模拟。

Genetics. 2022 Mar 3;220(3). doi: 10.1093/genetics/iyab229.

Mutation saturation for fitness effects at human CpG sites.人类 CpG 位点的适合度效应的突变饱和。

Elife. 2021 Nov 22;10:e71513. doi: 10.7554/eLife.71513.

本文引用的文献

Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.对 6515 个外显子组的分析揭示了大多数人类蛋白质编码变异的近期起源。

Nature. 2013 Jan 10;493(7431):216-20. doi: 10.1038/nature11690. Epub 2012 Nov 28.

Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion.基于 SNP 数据的谱方法进行人口推断，并分析人类走出非洲的扩张。

Genetics. 2012 Oct;192(2):619-39. doi: 10.1534/genetics.112.141846. Epub 2012 Aug 3.

An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people.在 14002 个人中对 202 个药物靶标基因进行测序，发现了大量罕见的功能变异。

Science. 2012 Jul 6;337(6090):100-4. doi: 10.1126/science.1217876. Epub 2012 May 17.

Evolution and functional impact of rare coding variation from deep sequencing of human exomes.人类外显子组深度测序中罕见编码变异的进化和功能影响。

Science. 2012 Jul 6;337(6090):64-9. doi: 10.1126/science.1219240. Epub 2012 May 17.

Recent explosive human population growth has resulted in an excess of rare genetic variants.最近人类人口的爆炸式增长导致了罕见遗传变异体的过剩。

Science. 2012 May 11;336(6082):740-3. doi: 10.1126/science.1217283.

Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent.固定家系内的基因系谱和 Kingman 合并的稳健性。

Genetics. 2012 Apr;190(4):1433-45. doi: 10.1534/genetics.111.135574. Epub 2012 Jan 10.

Deep human genealogies reveal a selective advantage to be on an expanding wave front.深度的人类族谱揭示了处于扩张波阵前沿的一种选择优势。

Science. 2011 Nov 25;334(6059):1148-50. doi: 10.1126/science.1212880. Epub 2011 Nov 3.

Demographic history and rare allele sharing among human populations.人口历史与人类群体中的罕见等位基因共享。

Proc Natl Acad Sci U S A. 2011 Jul 19;108(29):11983-8. doi: 10.1073/pnas.1019276108. Epub 2011 Jul 5.

Non-equilibrium allele frequency spectra via spectral methods.通过谱方法得到的非平衡等位基因频率谱

Theor Popul Biol. 2011 Jun;79(4):203-19. doi: 10.1016/j.tpb.2011.02.003. Epub 2011 Mar 2.

Deep resequencing reveals excess rare recent variants consistent with explosive population growth.深度重测序揭示了与人口爆炸式增长相一致的过量罕见近期变异。

Nat Commun. 2010 Nov 30;1:131. doi: 10.1038/ncomms1130.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验