Division of Paleontology (Invertebrates), American Museum of Natural History, New York, NY, USA.
Department of Computer Science, Hunter College, City University of New York, New York, NY, USA.
Syst Biol. 2021 Oct 13;70(6):1163-1180. doi: 10.1093/sysbio/syab005.
Popular optimality criteria for phylogenetic trees focus on sequences of characters that are applicable to all the taxa. As studies grow in breadth, it can be the case that some characters are applicable for a portion of the taxa and inapplicable for others. Past work has explored the limitations of treating inapplicable characters as missing data, noting that this strategy may favor trees where internal nodes are assigned impossible states, where the arrangement of taxa within subclades is unduly influenced by variation in distant parts of the tree, and/or where taxa that otherwise share most primary characters are grouped distantly. Approaches that avoid the first two problems have recently been proposed. Here, we propose an alternative approach which avoids all three problems. We focus on data matrices that use reductive coding of traits, that is, explicitly incorporate the innate hierarchy induced by inapplicability, and as such our approach extend to hierarchical characters, in general. In the spirit of maximum parsimony, the proposed criterion seeks the phylogenetic tree with the minimal changes across any tree branch, but where changes are defined in terms of dissimilarity metrics that weigh the effects of inapplicable characters. The approach can accommodate binary, multistate, ordered, unordered, and polymorphic characters. We give a polynomial-time algorithm, inspired by Fitch's algorithm, to score trees under a family of dissimilarity metrics, and prove its correctness. We show that the resulting optimality criteria is computationally hard, by reduction to the NP-hardness of the maximum parsimony optimality criteria. We demonstrate our approach using synthetic and empirical data sets and compare the results with other recently proposed methods for choosing optimal phylogenetic trees when the data includes hierarchical characters. [Character optimization, dissimilarity metrics, hierarchical characters, inapplicable data, phylogenetic tree search.].
流行的系统发育树最优性标准主要集中在适用于所有分类群的字符序列上。随着研究范围的扩大,有些字符可能适用于一部分分类群,而不适用于其他分类群。过去的工作已经探讨了将不适用于数据视为缺失数据的局限性,指出这种策略可能有利于那些内部节点被赋予不可能状态的树,在这些树中,分类群在亚群内的排列受到树中遥远部分变异的不当影响,以及/或者那些原本共享大多数主要特征的分类群被远远地分组。最近提出了避免前两个问题的方法。在这里,我们提出了一种避免所有三个问题的替代方法。我们专注于使用性状简约编码的数据矩阵,即明确纳入不适用于性诱导的内在层次结构,因此我们的方法一般扩展到层次性状。在最大简约性的精神下,所提出的标准寻求在任何树分支上变化最小的系统发育树,但其中的变化是根据不相似性度量来定义的,这些度量衡量了不适用于性的影响。该方法可以容纳二进制、多态、有序、无序和多态特征。我们受 Fitch 算法的启发,提出了一种在不相似性度量族下评分树的多项式时间算法,并证明了其正确性。我们通过将最大简约性最优性标准的 NP 难度归约到证明其计算难度。我们使用合成和实证数据集展示了我们的方法,并将结果与其他最近提出的方法进行比较,当数据包括层次结构特征时,选择最优的系统发育树。[性状优化、不相似性度量、层次性状、不适用于数据、系统发育树搜索。]