Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):902-11. doi: 10.1109/TCBB.2011.28.
Many of the steps in phylogenetic reconstruction can be confounded by “rogue” taxa—taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from “unsupported” to “supported” status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).
系统发育重建的许多步骤都会受到“异常”分类单元的干扰,这些分类单元无法确定地放置在树中的任何位置,实际上,它们在树中的位置随着算法或参数的几乎任何选择而变化。系统发育共识方法尤其存在这个问题。在本文中,我们提供了一个定义和识别异常分类单元的新框架。在这个框架中,我们制定了一个双标准优化问题,相对信息标准,它模拟了当从输入数据中删除某些分类单元时,共识树中存在的有用信息的净增加。我们还提供了一种有效的贪婪启发式方法来识别一组异常分类单元,并在一系列实验中使用这种启发式方法,包括来自文献的病理示例和一组大型生物数据集。由于异常分类单元的存在会导致误导性的支持值较差,因此我们提出了一种根据我们的算法识别出的异常分类单元重新计算支持值的过程;将该过程应用于我们的生物数据集导致大量边缘从“不支持”变为“支持”状态,这表明许多现有的系统发育树应该重新计算和重新评估,以减少异常分类单元引入的任何不准确之处。我们还讨论了在将我们的算法集成到 RAxML v7.2.7 中时遇到的实现问题,特别是那些涉及扩展分析的问题。这种集成使从业者能够从我们的算法中受益于非常大数据集的分析(多达 2500 个分类单元和 10000 棵树,尽管我们呈现了更大分析的结果)。