Suppr超能文献

在异常区域使用逆转录元件插入来估计物种树时的理论与实践考量

Theoretical and Practical Considerations when using Retroelement Insertions to Estimate Species Trees in the Anomaly Zone.

作者信息

Molloy Erin K, Gatesy John, Springer Mark S

机构信息

Department of Computer Science, University of Maryland, College Park, College Park, 20742, USA.

Department of Mammalogy, American Museum of Natural History, New York, 10024, USA.

出版信息

Syst Biol. 2022 Apr 19;71(3):721-740. doi: 10.1093/sysbio/syab086.

Abstract

A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin-Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.].

摘要

用于物种树估计的串联方法的一个潜在缺点是它们未能考虑不完全谱系分选。合并方法解决了这个问题,但做出了各种假设,如果这些假设不成立,可能会导致比串联方法更差的性能。鉴于使用串联和合并方法分析DNA序列都存在挑战,反转录元件插入(RIs)已成为用于物种树估计的强大系统发育基因组标记。在这里,我们表明,当RIs遵循中性无限位点突变模型且每个世代新RIs的预期数量在整个物种树中恒定时,最近提出的两种基于四重奏的方法SDPquartets和ASTRAL_BP在合并模型下是无根物种树拓扑结构的统计一致估计器。这些(以及其他)从RIs推断物种树的方法在已知真实物种树拓扑结构的模拟数据集上的准确性尚未得到评估。因此,我们评估了八种方法,这些方法基于从四个模型物种树模拟得到的RIs,所有这些物种树都有短分支,并且其中至少三个处于异常区域。在我们的模拟研究中,正如预测的那样,当给定足够数量的RIs时,ASTRAL_BP和SDPquartets总是能恢复正确的物种树拓扑结构。一种基于距离的方法(ASTRID_BP)和多洛简约法在恢复物种树拓扑结构方面也表现良好。相比之下,无序简约法、多态性简约法和卡明 - 索卡尔简约法(以及基于MDC的方法)在具有超过四个内群分类单元的异常区域情况下通常无法恢复正确的物种树拓扑结构。在所研究的方法中,只有ASTRAL_BP能自动估计内部分支长度(以合并单位计)和支持值(即局部后验概率)。我们检查了分支长度估计的准确性,发现对于短分支,估计长度是准确的,但在其他情况下存在向上偏差。这促使我们推导当以RIs而非二元基因树作为输入时的最大似然(分支长度)估计;在我们的模拟研究中,只要输入足够数量的RIs,这个修正公式就能产生准确的分支长度估计。最后,我们通过用从100到100,000个简约信息性RIs变化的输入大小重复上述实验,评估了数据量对物种树估计的影响。我们发现,当仅以1000个简约信息性RIs作为输入时,ASTRAL_BP成功地以高支持度重建了主要分支(即由大于(0.3)个合并单位的分支分隔的分支)并识别了快速辐射(即较短的相连分支),尽管不能确定其精确的分支顺序。在这些情况下,局部后验概率对于控制假阳性分支是有效的。[合并;不完全谱系分选;劳亚兽总目;古颚类;简约法;多态性简约法;反转录元件插入;物种树;转座子。]

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验