Suppr超能文献

自适应 RAxML-NG:利用数据集难度加速最大似然法下的系统发育推断。

Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty.

机构信息

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.

Institute of Theoretical Informatics, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany.

出版信息

Mol Biol Evol. 2023 Oct 4;40(10). doi: 10.1093/molbev/msad227.

Abstract

Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).

摘要

系统发育推断在最大似然准则下部署启发式树搜索策略来探索广阔的搜索空间。根据输入数据集的不同,来自不同起始树的搜索可能都收敛到单个树拓扑。然而,通常情况下,不同的搜索会推断出多个拓扑,它们的对数似然得分差异较大,或者产生拓扑高度不同但几乎同样可能的树。最近,Haag 等人引入了一种方法来量化,并且实现了机器学习方法来预测,关于系统发育推断的数据集难度。简单的多序列比对 (MSAs) 在它们的似然面上表现出单个似然峰值,与单个树拓扑相关联,大多数(如果不是全部)独立搜索快速收敛到该拓扑。随着难度的增加,多个局部最优似然峰值出现,但来自高度不同的拓扑。为了利用此信息,我们在 RAxML-NG 中引入并实现了一种自适应树搜索启发式方法,该方法根据预测的难度来修改树搜索策略的彻底性。我们的自适应策略基于三个观察结果。首先,在简单的数据集上,搜索迅速收敛,因此可以在更早的阶段终止。其次,对困难数据集进行过度分析是没有希望的,因此足以快速推断出众多几乎同样可能的拓扑中的一个,以减少总体执行时间。第三,对于具有中等难度的数据集,需要进行更广泛的搜索。在这种情况下,似然面表现出多个局部最优峰值,但其中一小部分明显更好。我们在 9515 个经验数据集和 5000 个具有不同难度的模拟数据集上对自适应启发式的实验结果显示出了显著的加速,尤其是在简单和困难数据集上(占总 MSAs 的 53%),我们观察到平均加速超过 10 倍。此外,使用自适应策略推断的大约 94%的树在统计上与使用标准策略(RAxML-NG)推断的树没有区别。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd43/10584362/bf45845bdb9d/msad227f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验