略有次优似然基因树之间的分歧和支持。

Divergence and support among slightly suboptimal likelihood gene trees.

机构信息

Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.

305 W. Magnolia Street PMB 134, Fort Collins, CO, 80521, USA.

出版信息

Cladistics. 2020 Jun;36(3):322-340. doi: 10.1111/cla.12404. Epub 2019 Nov 13.

DOI:10.1111/cla.12404

PMID:34618962

Abstract

Contemporary phylogenomic studies frequently incorporate two-step coalescent analyses wherein the first step is to infer individual-gene trees, generally using maximum-likelihood implemented in the popular programs PhyML or RAxML. Four concerns with this approach are that these programs only present a single fully resolved gene tree to the user despite potential for ambiguous support, insufficient phylogenetic signal to fully resolve each gene tree, inexact computer arithmetic affecting the reported likelihood of gene trees, and an exclusive focus on the most likely tree while ignoring trees that are only slightly suboptimal or within the error tolerance. Taken together, these four concerns are sufficient for RAxML and PhyML users to be suspicious of the resulting (perhaps over-resolved) gene-tree topologies and (perhaps unjustifiably high) bootstrap support for individual clades. In this study, we sought to determine how frequently these concerns apply in practice to contemporary phylogenomic studies that use RAxML for gene-tree inference. We did so by re-analyzing 100 genes from each of ten studies that, taken together, are representative of many empirical phylogenomic studies. Our seven findings are as follows. First, the few search replicates that are frequently applied in phylogenomic studies are generally insufficient to find the optimal gene-tree topology. Second, there is often more topological variation among slightly suboptimal gene trees relative to the best-reported tree than can be safely ignored. Third, the Shimodaira-Hasegawa-like approximate likelihood ratio test is highly effective at identifying dubiously supported clades and outperforms the alternative approaches of relying on bootstrap support or collapsing minimum-length branches. Fourth, the bootstrap can, but rarely does, indicate high support for clades that are not supported amongst slightly suboptimal trees. Fifth, increasing the accuracy by which RAxML optimizes model-parameter values generally has a nominal effect on selection of optimal trees. Sixth, tree searches using the GTRCAT model were generally less effective at finding optimal known trees than those using the GTRGAMMA model. Seventh, choice of gene-tree sampling strategy can affect inferred coalescent branch lengths, species-tree topology and branch support.

摘要

当代系统发育基因组学研究经常采用两步合并分析，第一步是推断个体基因树，通常使用流行程序 PhyML 或 RAxML 中的最大似然法。这种方法存在四个问题：这些程序仅向用户呈现单个完全解决的基因树，尽管存在支持不明确、每个基因树的系统发育信号不足、不完全准确的计算机算法影响基因树报告的可能性以及对最可能的树的单一关注而忽略了稍微不太理想或在误差容限内的树。综上所述，这四个问题足以让 RAxML 和 PhyML 用户怀疑所得（可能过度解析）的基因树拓扑结构和（可能不合理地高）个别分支的引导支持。在这项研究中，我们试图确定这些问题在多大程度上适用于使用 RAxML 进行基因树推断的当代系统发育基因组学研究。我们通过重新分析十个研究中每个研究的 100 个基因来做到这一点，这十个研究一起代表了许多经验系统发育基因组学研究。我们的七个发现如下。首先，在系统发育基因组学研究中经常应用的少数搜索复制通常不足以找到最佳基因树拓扑结构。其次，相对于最佳报告树，略微不太理想的基因树之间通常存在更多的拓扑变化，不能安全忽略。第三，Shimodaira-Hasegawa 类似的近似似然比检验非常有效地识别可疑支持的分支，并且优于依赖引导支持或折叠最小长度分支的替代方法。第四，引导支持很少但确实可以指示在略微不太理想的树中不支持的分支的高支持。第五，增加 RAxML 优化模型参数值的准确性通常对选择最佳树的影响微不足道。第六，使用 GTRCAT 模型的树搜索通常不如使用 GTRGAMMA 模型的搜索更有效地找到最佳已知树。第七，基因树采样策略的选择会影响推断的合并分支长度、种系树拓扑结构和分支支持。