系统发育基因组数据集简约分析（二）：PAUP*、MEGA 和 MPBoot 的评估。

Parsimony analysis of phylogenomic datasets (II): evaluation of PAUP*, MEGA and MPBoot.

机构信息

Unidad Ejecutora Lillo, Consejo Nacional de Investigaciones Científicas y Técnicas - Fundación Miguel Lillo, Miguel Lillo 251, San Miguel de Tucumán, Tucumán, 4000, Argentina.

American Museum of Natural History, 200 Central Park West, New York, NY, 10024, USA.

出版信息

Cladistics. 2022 Feb;38(1):126-146. doi: 10.1111/cla.12476. Epub 2021 Jul 14.

DOI:10.1111/cla.12476

PMID:35049082

Abstract

This paper examines the implementation of parsimony methods in the programs PAUP*, MEGA and MPBoot, and compares them with TNT. PAUP* implements standard, well-tested algorithms, and flexible search strategies and options for handling trees; its main drawback is the lack of advanced search algorithms, which makes it difficult to find most parsimonious trees for large and complex datasets. In addition, branch-swapping can be much slower than in TNT for datasets with large numbers of taxa, although this is only occasionally a problem for phylogenomic datasets given that they typically have small numbers of taxa. The parsimony implementation of MEGA has major drawbacks. MEGA often fails to find parsimonious trees because it does not perform all possible branch swapping subtree pruning regrafting (SPR)/tree bisection-reconnection (TBR) rearrangements. It furthermore fails to properly handle ambiguity or multiple equally parsimonious trees, and it uses the same addition sequence for all bootstrap replicates. The latter yields values of group support that depend on the order in which taxa are listed in the dataset. In addition, tree searches are very slow and do not facilitate the exploration of different starting points (as random seed is fixed). MPBoot searches for optimal trees using the ratchet, but it is based on SPR instead of TBR (and only evaluates by default a subset of the SPR rearrangements). MPBoot approximates bootstrap frequencies by first finding a sample of trees and then selecting from those trees for every replicate, without performing a tree-search. The approximation is too rough in many cases, producing serious under- or overestimations of the correct support values and, for most kinds of datasets, slower estimations than can be obtained with TNT. In addition, bootstrapping with PAUP*, MEGA or MPBoot can attribute strong supports to groups that have no support at all under any meaningful concept of support, such as likelihood ratios or Bremer supports. In TNT, this problem is decreased by using the strict consensus tree to represent each replicate, or eliminated entirely by using different approximations of the Bremer support.

摘要

本文考察了 PAUP*、MEGA 和 MPBoot 程序中简约法的实现，并将其与 TNT 进行了比较。PAUP* 实现了标准的、经过充分测试的算法，以及灵活的搜索策略和处理树的选项；其主要缺点是缺乏高级搜索算法，这使得对于大型和复杂数据集，很难找到最简约的树。此外，对于具有大量分类单元的数据集，分支交换可能比 TNT 慢得多，尽管对于基因组数据集来说，这只是偶尔出现的问题，因为它们通常具有较少的分类单元。MEGA 的简约实现存在重大缺陷。MEGA 经常无法找到简约树，因为它不执行所有可能的分支交换、子树修剪重新连接（SPR）/树二分连接（TBR）重排。它也不能正确处理歧义或多个同样简约的树，并且对所有自举重复使用相同的添加序列。后者产生的分组支持值取决于在数据集中列出分类单元的顺序。此外，树搜索非常缓慢，并且不利于探索不同的起点（因为随机种子是固定的）。MPBoot 使用棘轮搜索最优树，但它基于 SPR 而不是 TBR（并且仅默认评估 SPR 重排的一个子集）。MPBoot 通过首先找到树的样本，然后从这些树中为每个重复选择，而不执行树搜索，来近似自举频率。在许多情况下，这种近似过于粗糙，导致对正确支持值的严重低估或高估，并且对于大多数类型的数据集，比 TNT 获得的估计值慢。此外，使用 PAUP*、MEGA 或 MPBoot 进行自举可以将强支持归因于在任何有意义的支持概念下都没有支持的分组，例如似然比或 Bremer 支持。在 TNT 中，通过使用严格共识树来表示每个重复，可以减少这个问题，或者通过使用 Bremer 支持的不同近似完全消除这个问题。