McMahon Michelle M, Sanderson Michael J
Section of Evolution and Ecology, University of California Davis, Davis, CA 95616, USA.
Syst Biol. 2006 Oct;55(5):818-36. doi: 10.1080/10635150600999150.
A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse" matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.
基于GenBank第147版中2228个分类单元的序列推断出了蝶形花科豆科植物的综合系统发育。构建了一个半自动分析流程,用于从11881个序列库中下载、解析、组装、比对、合并并构建树。初始步骤包括全对全的BLAST相似性搜索及组装,采用一种构建长度均匀的一级序列簇的新策略。随后结合全局和局部比对协议,构建局部比对序列的更大二级簇,从而考虑到GenBank中存在的异质编码和非编码序列数据在长度上的巨大差异。接下来,检查簇中是否存在重复基因和其他潜在误导性序列,并根据分类单元重叠情况检查其与其他簇的可组合性。最后,构建了两个超级矩阵:一个仅基于一级簇的“稀疏”矩阵(1794个分类单元×53977个字符),以及一个基于二级簇的稍“密集”矩阵(2228个分类单元×33168个字符)。两个矩阵都非常稀疏,其95%的单元格包含空位或问号。对这些矩阵进行了广泛的启发式简约分析,使用确定性和随机性启发式方法,包括自展分析。还进行了“简化共识”自展分析,以检测数据集中对应于先前研究中提出的“主干”系统发育的子树中的隐藏信号。总体而言,密集超级矩阵似乎提供了更令人满意的结果,自展树的分辨率更高、在简化自展共识分析中与主干蝶形花科树高度一致、严格共识中几乎没有问题较大的多歧分支,以及传统认可属的碎片化程度更低。然而,在较低分类水平上发现并诊断出了几个问题。讨论了在如此规模的超级矩阵构建中的大量方法学问题,包括检测GenBank序列中的注释错误;缺乏用于局部多序列比对的有效算法和软件;克服稀疏超级矩阵中数据碎片化到几乎不相交块的影响的困难;以及缺乏用于评估非常大的树中置信限的信息工具。