多物种合并下遗传数据的随机法里斯变换及其在数据需求方面的应用。

A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements.

机构信息

School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, USA.

Department of Mathematics and IDSS, Massachusetts Institute of Technology, Cambridge, USA.

出版信息

J Math Biol. 2022 Apr 8;84(5):36. doi: 10.1007/s00285-022-01731-5.

DOI:10.1007/s00285-022-01731-5

PMID:35394192

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9258723/

Abstract

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.

摘要

物种树估计面临许多重大障碍。其中最主要的是，描述每个基因祖先谱系的树——基因树——通常与物种树不同。多物种合并通常用于模拟这种基因树分歧，至少当它被认为是由不完全谱系分选引起的，这是一种群体遗传效应。该领域的另一个重大挑战是，与每个基因相关的分子序列通常提供关于基因树本身的有限信息。虽然单一位点替换的序列进化建模研究得很好，但实际上很少有具有理论保证的物种树重建方法解决这个后一个问题。相反，一个标准的但不满意的假设是，在将基因树输入所谓的总结方法之前，基因树被完美重建。因此，在开发严格考虑基因树估计误差的推断方法学方面，或者首先完全避免基因树估计方面，还有很多工作要做。在以前的工作中，在用于准确重建所需的基因座数量 m 与基因座序列长度 k 之间导出了一个数据要求权衡。结果表明，要重建一个长度为 f 的内部分支，需要 m 的数量级为 [公式：见文本]。以前的结果是在突变率以及种群大小在物种系统发育上都是恒定的这一限制假设下获得的。在这里，我们在超越这一假设的情况下进一步推广了这一结果。我们的主要贡献是在多物种合并下对分子钟情况的一种新颖简化，我们称之为随机 Farris 变换。作为推论，我们还获得了一个独立的新可识别性结果：对于任何具有 [公式：见文本] 个物种的物种树，即使在没有分子钟的情况下，也可以从其无根加权基因树的分布中识别出物种树的根拓扑。

相似文献

A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements.多物种合并下遗传数据的随机法里斯变换及其在数据需求方面的应用。

J Math Biol. 2022 Apr 8;84(5):36. doi: 10.1007/s00285-022-01731-5.

Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent.在溯祖理论下，根据无根基因树的分布确定有根物种树。

J Math Biol. 2011 Jun;62(6):833-62. doi: 10.1007/s00285-010-0355-7. Epub 2010 Jul 23.

Identifiability and Reconstructibility of Species Phylogenies Under a Modified Coalescent.修改后的合并模型下的物种系统发育的可识别性和可重建性。

Bull Math Biol. 2019 Feb;81(2):408-430. doi: 10.1007/s11538-018-0456-9. Epub 2018 Jun 20.

Anomalous unrooted gene trees.异常无根基因树。

Syst Biol. 2013 Jul;62(4):574-90. doi: 10.1093/sysbio/syt023. Epub 2013 Apr 10.

Theoretical and Practical Considerations when using Retroelement Insertions to Estimate Species Trees in the Anomaly Zone.在异常区域使用逆转录元件插入来估计物种树时的理论与实践考量

Syst Biol. 2022 Apr 19;71(3):721-740. doi: 10.1093/sysbio/syab086.

Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model.分裂概率与多物种合并模型下的种系发生树推断。

Bull Math Biol. 2018 Jan;80(1):64-103. doi: 10.1007/s11538-017-0363-5. Epub 2017 Nov 10.

Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models.在重复-缺失和重复-缺失-转移模型中计算和采样基因家族进化历史。

J Math Biol. 2020 Apr;80(5):1353-1388. doi: 10.1007/s00285-019-01465-x. Epub 2020 Feb 15.

Applying species-tree analyses to deep phylogenetic histories: challenges and potential suggested from a survey of empirical phylogenetic studies.将物种树分析应用于深层系统发育历史：基于实证系统发育研究调查提出的挑战与潜力

Mol Phylogenet Evol. 2015 Feb;83:191-9. doi: 10.1016/j.ympev.2014.10.022. Epub 2014 Nov 4.

Statistical inconsistency of the unrooted minimize deep coalescence criterion.无根最小深度融合准则的统计不一致性。

PLoS One. 2021 May 10;16(5):e0251107. doi: 10.1371/journal.pone.0251107. eCollection 2021.

Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers.尽管存在不完全谱系分选现象，仍可根据基因树概率估计物种系统发育：以黑蝗属蝗虫为例。

Syst Biol. 2007 Jun;56(3):400-11. doi: 10.1080/10635150701405560.

本文引用的文献

SPECIES TREE INFERENCE FROM GENOMIC SEQUENCES USING THE LOG-DET DISTANCE.利用对数行列式距离从基因组序列推断物种树

SIAM J Appl Algebr Geom. 2019;3(1):107-127. doi: 10.1137/18m1194134. Epub 2019 Mar 14.

Phylogenetic tree building in the genomic age.基因组时代的系统发育树构建。

Nat Rev Genet. 2020 Jul;21(7):428-444. doi: 10.1038/s41576-020-0233-0. Epub 2020 May 18.

Topological Metrizations of Trees, and New Quartet Methods of Tree Inference.树的拓扑度量及其新的四重树推断方法。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):2107-2118. doi: 10.1109/TCBB.2019.2917204. Epub 2020 Dec 8.

Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods.种系树估计中的长枝吸引：分区似然和基于拓扑的总结方法的不一致性。

Syst Biol. 2019 Mar 1;68(2):281-297. doi: 10.1093/sysbio/syy061.

Identifiability and Reconstructibility of Species Phylogenies Under a Modified Coalescent.修改后的合并模型下的物种系统发育的可识别性和可重建性。

Bull Math Biol. 2019 Feb;81(2):408-430. doi: 10.1007/s11538-018-0456-9. Epub 2018 Jun 20.

Species Tree Estimation Using ASTRAL: How Many Genes Are Enough?使用 ASTRAL 估算种系发生树：需要多少基因？

IEEE/ACM Trans Comput Biol Bioinform. 2018 Sep-Oct;15(5):1738-1747. doi: 10.1109/TCBB.2017.2757930. Epub 2017 Sep 29.

Species Tree Inference from Gene Splits by Unrooted STAR Methods.无树根 STAR 方法从基因分裂推断种系树。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Jan-Feb;15(1):337-342. doi: 10.1109/TCBB.2016.2604812. Epub 2016 Aug 31.

Species tree estimation using Neighbor Joining.使用邻接法进行物种树估计。

J Theor Biol. 2017 Feb 7;414:5-7. doi: 10.1016/j.jtbi.2016.11.005. Epub 2016 Nov 17.

Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method.基于多个基因座进行系统发育推断的数据要求：一种新的距离方法。

IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):422-32. doi: 10.1109/TCBB.2014.2361685.

Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses.加权统计分箱法：实现统计上一致的全基因组系统发育分析

PLoS One. 2015 Jun 18;10(6):e0129183. doi: 10.1371/journal.pone.0129183. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验