高效总结大样本中的关系：谱系学和基因组统计之间的一般对偶性。

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.

机构信息

Institute of Evolution and Ecology, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97405

Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697.

出版信息

Genetics. 2020 Jul;215(3):779-797. doi: 10.1534/genetics.120.303253. Epub 2020 May 1.

DOI:10.1534/genetics.120.303253

PMID:32357960

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7337078/

Abstract

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by , which corresponds to statistics calculated as usual from genome sequence; by , which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by , which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

摘要

随着基因突变在世代间传递，它将那些遗传了该突变的基因组与没有遗传该突变的基因组区分开来，从而提供了一个关于在该位置基因组之间亲缘关系的 glimpses。因此，遗传变异的统计摘要还描述了潜在的亲缘关系。我们利用这种对应关系来定义一个通用框架，该框架使用简洁的树序列编码对基因组序列进行单一位点群体遗传统计的高效计算。该通用方法在基因组上的每个位置在亲缘关系树上累积样本权重，然后使用摘要函数对其进行组合；不同的权重和函数会产生不同的统计结果。结果可以通过三种方式报告：通过，它对应于通常从基因组序列计算的统计数据；通过，它给出了在突变的无限位点模型下对偶位点统计量的期望值，通过，它总结了每个祖先对这些统计数据的贡献。我们使用该框架实现了目前定义的许多基因组序列统计量（使统计量与潜在的亲缘关系树之间的关系具体化和明确化），以及树形状的相应分支统计量。我们使用模拟数据评估计算性能，并表明使用该通用框架从树序列计算统计数据在运行时间和内存需求方面比基于矩阵的优化方法高效几个数量级。我们还探讨了在从 1000 基因组计划数据集推断的树上，位点和分支统计量之间的对偶性在实践中的吻合程度，并讨论了偏差可能编码有趣的生物学信号的方式。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/425d/7337078/e7cb3a397e62/779f1.jpg

相似文献

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.高效总结大样本中的关系：谱系学和基因组统计之间的一般对偶性。

Genetics. 2020 Jul;215(3):779-797. doi: 10.1534/genetics.120.303253. Epub 2020 May 1.

Efficient pedigree recording for fast population genetics simulation.高效的家系记录，实现快速的群体遗传学模拟。

PLoS Comput Biol. 2018 Nov 1;14(11):e1006581. doi: 10.1371/journal.pcbi.1006581. eCollection 2018 Nov.

Inferring whole-genome histories in large population datasets.在大型人群数据集推断全基因组历史。

Nat Genet. 2019 Sep;51(9):1330-1338. doi: 10.1038/s41588-019-0483-y. Epub 2019 Sep 2.

GENLIB: an R package for the analysis of genealogical data.GENLIB：一个用于分析家谱数据的R软件包。

BMC Bioinformatics. 2015 May 15;16:160. doi: 10.1186/s12859-015-0581-5.

Full likelihood inference from the site frequency spectrum based on the optimal tree resolution.基于最优树分辨率从位点频率谱进行全似然推断。

Theor Popul Biol. 2018 Dec;124:1-15. doi: 10.1016/j.tpb.2018.07.002. Epub 2018 Jul 23.

Distance metrics for ranked evolutionary trees.排序进化树的距离度量。

Proc Natl Acad Sci U S A. 2020 Nov 17;117(46):28876-28886. doi: 10.1073/pnas.1922851117. Epub 2020 Nov 2.

Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent.固定家系内的基因系谱和 Kingman 合并的稳健性。

Genetics. 2012 Apr;190(4):1433-45. doi: 10.1534/genetics.111.135574. Epub 2012 Jan 10.

An efficient algorithm for generating the internal branches of a Kingman coalescent.一种用于生成金曼合并过程内部分支的高效算法。

Theor Popul Biol. 2018 Jul;122:57-66. doi: 10.1016/j.tpb.2017.05.002. Epub 2017 Jul 11.

A likelihood-based framework for demographic inference from genealogical trees.一种基于似然性的从系谱树进行人口统计学推断的框架。

bioRxiv. 2023 Oct 13:2023.10.10.561787. doi: 10.1101/2023.10.10.561787.

A method for genome-wide genealogy estimation for thousands of samples.一种用于对数千个样本进行全基因组谱系估计的方法。

Nat Genet. 2019 Sep;51(9):1321-1329. doi: 10.1038/s41588-019-0484-x. Epub 2019 Sep 2.

引用本文的文献

Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes.针对数百个基因组的全基因组谱系进行稳健且准确的贝叶斯推断。

Nat Genet. 2025 Sep 8. doi: 10.1038/s41588-025-02317-9.

Tsbrowse: an interactive browser for ancestral recombination graphs.Tsbrowse：一种用于祖先重组图的交互式浏览器。

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf393.

Admixed and single-continental genome segments of the same ancestry have distinct linkage disequilibrium patterns.具有相同祖先的混合和单一大陆基因组片段具有不同的连锁不平衡模式。

Genome Biol. 2025 Jul 11;26(1):201. doi: 10.1186/s13059-025-03672-w.

Recent Statistical Innovations in Human Genetics.人类遗传学领域的最新统计创新

Ann Hum Genet. 2025 Sep;89(5):241-254. doi: 10.1111/ahg.12606. Epub 2025 Jun 27.

An ancient origin of the naked grains of maize.玉米裸粒的古老起源。

Proc Natl Acad Sci U S A. 2025 Jun 24;122(25):e2503748122. doi: 10.1073/pnas.2503748122. Epub 2025 Jun 17.

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

Parameter Scaling in Population Genetics Simulations may Introduce Unintended Background Selection: Considerations for Scaled Simulation Design.群体遗传学模拟中的参数缩放可能会引入意外的背景选择：缩放模拟设计的考量

Genome Biol Evol. 2025 May 30;17(6). doi: 10.1093/gbe/evaf097.

Potential and pitfalls of using identity-by-descent for malaria genomic surveillance.利用同源性进行疟疾基因组监测的潜力与陷阱

Trends Parasitol. 2025 May;41(5):387-400. doi: 10.1016/j.pt.2025.03.012. Epub 2025 Apr 21.

Accessible, realistic genome simulation with selection using stdpopsim.使用stdpopsim进行具有选择的可访问、现实的基因组模拟。

bioRxiv. 2025 Mar 23:2025.03.23.644823. doi: 10.1101/2025.03.23.644823.

Archaic introgression and the distribution of shared variation under stabilizing selection.古老基因渗入与稳定选择下共享变异的分布

PLoS Genet. 2025 Mar 31;21(3):e1011623. doi: 10.1371/journal.pgen.1011623. eCollection 2025 Mar.

本文引用的文献

Sparse Project VCF: efficient encoding of population genotype matrices.稀疏项目 VCF：群体基因型矩阵的有效编码。

Bioinformatics. 2021 Apr 1;36(22-23):5537-5538. doi: 10.1093/bioinformatics/btaa1004.

A Few Stickleback Suffice for the Transport of Alleles to New Lakes.少数棘鱼就足以将等位基因输送到新湖中。

G3 (Bethesda). 2020 Feb 6;10(2):505-514. doi: 10.1534/g3.119.400564.

Inferring whole-genome histories in large population datasets.在大型人群数据集推断全基因组历史。

Nat Genet. 2019 Sep;51(9):1330-1338. doi: 10.1038/s41588-019-0483-y. Epub 2019 Sep 2.

A method for genome-wide genealogy estimation for thousands of samples.一种用于对数千个样本进行全基因组谱系估计的方法。

Nat Genet. 2019 Sep;51(9):1321-1329. doi: 10.1038/s41588-019-0484-x. Epub 2019 Sep 2.

From a database of genomes to a forest of evolutionary trees.从基因组数据库到进化树森林。

Nat Genet. 2019 Sep;51(9):1306-1307. doi: 10.1038/s41588-019-0492-x.

Widespread selection and gene flow shape the genomic landscape during a radiation of monkeyflowers.广泛的选择和基因流在猴面花辐射过程中塑造了基因组景观。

PLoS Biol. 2019 Jul 24;17(7):e3000391. doi: 10.1371/journal.pbio.3000391. eCollection 2019 Jul.

East Anglian early Neolithic monument burial linked to contemporary Megaliths.东安格利亚早期新石器时代纪念碑墓葬与当代巨石阵有关。

Ann Hum Biol. 2019 Mar;46(2):145-149. doi: 10.1080/03014460.2019.1623912.

An empirical approach to demographic inference with genomic data.一种利用基因组数据进行人口统计学推断的实证方法。

Theor Popul Biol. 2019 Jun;127:91-101. doi: 10.1016/j.tpb.2019.03.005. Epub 2019 Apr 9.

Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes.SLiM 中的树序列记录为全基因组的正向时间模拟开辟了新的视野。

Mol Ecol Resour. 2019 Mar;19(2):552-566. doi: 10.1111/1755-0998.12968. Epub 2019 Feb 21.

SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model.SLiM 3：超越 Wright-Fisher 模型的正向遗传模拟。

Mol Biol Evol. 2019 Mar 1;36(3):632-637. doi: 10.1093/molbev/msy228.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

高效总结大样本中的关系：谱系学和基因组统计之间的一般对偶性。

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献