Auch Alexander F, Henz Stefan R, Holland Barbara R, Göker Markus
Center for Bioinformatics ZBIT, Sand 14, Tübingen, University of Tübingen, Germany.
BMC Bioinformatics. 2006 Jul 19;7:350. doi: 10.1186/1471-2105-7-350.
Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all completely sequenced plastid genomes currently available and from a selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN, TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between two sequences from which pairwise similarities and distances are computed in different ways resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny reconstruction is directly estimated by computing a recently described measure of "treelikeness", the so-called delta value, from the respective distance matrices. Additionally, we compare the trees inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the NCBI taxonomy tree of the taxa under study.
Our results indicate that, at this taxonomic level, plastid genomes are much more valuable for inferring phylogenies than are mitochondrial genomes, and that distances based on breakpoints are of little use. Distances based on the proportion of "matched" HSP length to average genome length were best for tree estimation. Additionally we found that using TBLASTX instead of BLASTN and, particularly, combining TBLASTX and BLASTN leads to a small but significant increase in accuracy. Other factors do not significantly affect the phylogenetic outcome. The BIONJ algorithm results in phylogenies most in accordance with the current NCBI taxonomy, with NJ and FastME performing insignificantly worse, and STC performing as well if applied to high quality distance matrices. delta values are found to be a reliable predictor of phylogenetic accuracy.
Using the most treelike distance matrices, as judged by their delta values, distance methods are able to recover all major plant lineages, and are more in accordance with Apicomplexa organelles being derived from "green" plastids than from plastids of the "red" type. GBDP-like methods can be used to reliably infer phylogenies from different kinds of genomic data. A framework is established to further develop and improve such methods. delta values are a topology-independent tool of general use for the development and assessment of distance methods for phylogenetic inference.
不依赖多序列比对的系统发育方法是直接从全测序基因组推断树的重要工具。在此,我们扩展了最近描述的基因组BLAST距离系统发育(GBDP)策略,以从目前所有可用的全测序质体基因组以及代表主要真核生物谱系的线粒体基因组中计算系统发育树。使用BLASTN、TBLASTX或两者的组合来定位两个序列之间的高分片段对(HSP),并以不同方式计算成对相似性和距离,从而产生总共96种GBDP变体。通过从各自的距离矩阵计算最近描述的“树状度”度量(即所谓的δ值),直接估计这些距离公式对系统发育重建的适用性。此外,我们分别使用UPGMA、NJ、BIONJ、FastME或STC将从这些矩阵推断出的树与所研究分类单元的NCBI分类树进行比较。
我们的结果表明,在此分类水平上,质体基因组在推断系统发育方面比线粒体基因组更有价值,并且基于断点的距离用处不大。基于“匹配”HSP长度与平均基因组长度比例的距离最适合用于树估计。此外,我们发现使用TBLASTX代替BLASTN,特别是将TBLASTX和BLASTN结合使用,会使准确性有小幅但显著的提高。其他因素对系统发育结果没有显著影响。BIONJ算法得出的系统发育树最符合当前的NCBI分类法,NJ和FastME的表现稍差但不显著,而STC应用于高质量距离矩阵时表现也不错。发现δ值是系统发育准确性的可靠预测指标。
使用根据其δ值判断为最具树状性的距离矩阵,距离方法能够恢复所有主要植物谱系,并且更符合顶复门细胞器源自“绿色”质体而非“红色”质体的观点。类似GBDP的方法可用于可靠地从不同类型的基因组数据推断系统发育。建立了一个框架以进一步开发和改进此类方法。δ值是一种与拓扑无关的通用工具,用于系统发育推断距离方法的开发和评估。