Chernomor Olga, Minh Bui Quang, von Haeseler Arndt
1 Max F. Perutz Laboratories, Center for Integrative Bioinformatics Vienna, University of Vienna , Vienna, Austria .
2 Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna , Vienna, Austria .
J Comput Biol. 2015 Dec;22(12):1129-42. doi: 10.1089/cmb.2015.0146. Epub 2015 Oct 8.
In phylogenomic analysis the collection of trees with identical score (maximum likelihood or parsimony score) may hamper tree search algorithms. Such collections are coined phylogenetic terraces. For sparse supermatrices with a lot of missing data, the number of terraces and the number of trees on the terraces can be very large. If terraces are not taken into account, a lot of computation time might be unnecessarily spent to evaluate many trees that in fact have identical score. To save computation time during the tree search, it is worthwhile to quickly identify such cases. The score of a species tree is the sum of scores for all the so-called induced partition trees. Therefore, if the topological rearrangement applied to a species tree does not change the induced partition trees, the score of these partition trees is unchanged. Here, we provide the conditions under which the three most widely used topological rearrangements (nearest neighbor interchange, subtree pruning and regrafting, and tree bisection and reconnection) change the topologies of induced partition trees. During the tree search, these conditions allow us to quickly identify whether we can save computation time on the evaluation of newly encountered trees. We also introduce the concept of partial terraces and demonstrate that they occur more frequently than the original "full" terrace. Hence, partial terrace is the more important factor of timesaving compared to full terrace. Therefore, taking into account the above conditions and the partial terrace concept will help to speed up the tree search in phylogenomic inference.
在系统发育基因组学分析中,具有相同分数(最大似然分数或简约分数)的树的集合可能会妨碍树搜索算法。这样的集合被称为系统发育阶地。对于具有大量缺失数据的稀疏超级矩阵,阶地的数量以及阶地上树的数量可能非常大。如果不考虑阶地,可能会不必要地花费大量计算时间来评估许多实际上具有相同分数的树。为了在树搜索过程中节省计算时间,快速识别这种情况是值得的。物种树的分数是所有所谓诱导划分树的分数之和。因此,如果应用于物种树的拓扑重排不改变诱导划分树,这些划分树的分数就不会改变。在这里,我们提供了三种最广泛使用的拓扑重排(最近邻交换、子树剪枝与重接以及树二分与重连)改变诱导划分树拓扑的条件。在树搜索过程中,这些条件使我们能够快速确定是否可以在评估新遇到的树时节省计算时间。我们还引入了部分阶地的概念,并证明它们比原始的“完整”阶地出现得更频繁。因此,与完整阶地相比,部分阶地是更重要的节省时间的因素。因此,考虑上述条件和部分阶地概念将有助于加快系统发育基因组学推断中的树搜索。