隐藏与消失：已知最简约树但难以找到的数据集，及其对树搜索方法的启示。

Consejo Nacional de Investigaciones Científicas y Técnicas, Fundación Miguel Lillo, Miguel Lillo 251, 4000 S.M. de Tucumán, Argentina.

Mol Phylogenet Evol. 2014 Oct;79:118-31. doi: 10.1016/j.ympev.2014.06.008. Epub 2014 Jun 18.

Three different types of data sets, for which the uniquely most parsimonious tree can be known exactly but is hard to find with heuristic tree search methods, are studied. Tree searches are complicated more by the shape of the tree landscape (i.e. the distribution of homoplasy on different trees) than by the sheer abundance of homoplasy or character conflict. Data sets of Type 1 are those constructed by Radel et al. (2013). Data sets of Type 2 present a very rugged landscape, with narrow peaks and valleys, but relatively low amounts of homoplasy. For such a tree landscape, subjecting the trees to TBR and saving suboptimal trees produces much better results when the sequence of clipping for the tree branches is randomized instead of fixed. An unexpected finding for data sets of Types 1 and 2 is that starting a search from a random tree instead of a random addition sequence Wagner tree may increase the probability that the search finds the most parsimonious tree; a small artificial example where these probabilities can be calculated exactly is presented. Data sets of Type 3, the most difficult data sets studied here, comprise only congruent characters, and a single island with only one most parsimonious tree. Even if there is a single island, missing entries create a very flat landscape which is difficult to traverse with tree search algorithms because the number of equally parsimonious trees that need to be saved and swapped to effectively move around the plateaus is too large. Minor modifications of the parameters of tree drifting, ratchet, and sectorial searches allow travelling around these plateaus much more efficiently than saving and swapping large numbers of equally parsimonious trees with TBR. For these data sets, two new related criteria for selecting taxon addition sequences in Wagner trees (the "selected" and "informative" addition sequences) produce much better results than the standard random or closest addition sequences. These new methods for Wagner trees and for moving around plateaus can be useful when analyzing phylogenomic data sets formed by concatenation of genes with uneven taxon representation ("sparse" supermatrices), which are likely to present a tree landscape with extensive plateaus.

研究了三种不同类型的数据集，对于这些数据集，可以准确地知道但很难通过启发式树搜索方法找到最简约的树。树搜索的复杂性不仅取决于树景观的形状（即同形性在不同树上的分布），还取决于同形性或字符冲突的绝对数量。类型 1 的数据集是由 Radel 等人构建的。类型 2 的数据集呈现出非常崎岖的景观，峰谷狭窄，但同形性相对较低。对于这样的树景观，通过 TBR 对树进行处理，并在随机化树分支的修剪顺序而不是固定顺序时保存次优树，可以产生更好的结果。对于类型 1 和 2 的数据集，一个意外的发现是，从随机树而不是随机添加序列 Wagner 树开始搜索可能会增加搜索找到最简约树的概率；提出了一个可以准确计算这些概率的小人工示例。类型 3 的数据集是这里研究的最困难的数据集中，仅包含一致的字符，并且只有一个具有唯一最简约树的孤岛。即使有一个孤岛，缺失的条目也会创建一个非常平坦的景观，这使得使用树搜索算法很难遍历，因为需要保存和交换以有效地绕过高原的等效简约树的数量太大。对树漂移、棘轮和扇形搜索的参数进行微小修改，可以比使用 TBR 保存和交换大量等效简约树更有效地在这些高原周围移动。对于这些数据集，Wagner 树中选择分类群添加序列的两个新相关标准（“选择”和“信息”添加序列）比标准随机或最近添加序列产生更好的结果。当分析由基因不均匀分类群表示（“稀疏”超矩阵）串联形成的基因组数据集时，这些用于 Wagner 树和在高原周围移动的新方法可能非常有用，因为它们可能呈现出广泛的高原树景观。