Lazareva-Ulitsky Betty, Diemer Karen, Thomas Paul D
Computational Biology Department, Applied Biosystems, Foster City, CA 94404, USA.
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large datasets requires fast algorithms; and even 'fast', approximate distance matrix-based phylogenetic algorithms are slow on large datasets since they involve calculating maximum likelihood estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees or dendrograms. How can these trees be compared in their ability to accurately classify protein sequences?
Given a 'reference classification' or 'group membership labels' for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into monophyletic or paraphyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call the achieved optimal correspondence the 'accuracy of a tree-based classification (TBC)', which measures the ability of a tree to separate proteins of similar function into monophyletic or paraphyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate TBC. We used our measure to help us to design TIPS, a tree-building algorithm, based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy, TIPS is being used in the large-scale PANTHER protein classification project. The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal TBC along with the reference classification.
The script that evaluates the accuracy of TBC is available at http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp
蛋白质序列的系统发育分析广泛应用于蛋白质功能分类以及在较大家族中划分亚家族。此外,近期带有描述功能的受控词汇术语(如基因本体论)的蛋白质序列条目的数量有所增加,这表明有可能将这些术语叠加到系统发育树上,以自动定位蛋白质家族进化中的功能分歧事件。对大型数据集进行系统发育分析需要快速算法;即便“快速”的、基于近似距离矩阵的系统发育算法在大型数据集上也很缓慢,因为它们涉及计算成对进化距离的最大似然估计。已经有许多尝试在不重建系统发育树的情况下,在家族和亚家族层面上对蛋白质序列进行分类,而是使用具有更简单距离度量的层次聚类,这种方法也会生成树或树状图。如何比较这些树在准确分类蛋白质序列方面的能力呢?
给定一组相关蛋白质序列的“参考分类”或“组成员标签”以及描述它们关系的一棵树(如系统发育树),我们提出一种方法,将树划分为单系或并系类群,以优化参考类群与源自树的类群之间的对应关系。我们将所实现的最优对应关系称为“基于树的分类(TBC)的准确性”,它衡量一棵树将功能相似的蛋白质分离到单系或并系类群中的能力。我们应用此度量来比较经典的邻接法(NJ)和类平均法(UPGMA)系统发育树与使用不同蛋白质相似性度量从层次聚类获得的树。我们对一组专家整理的蛋白质家族和比对进行的初步分析表明,不存在统一优越的算法,并且简单的蛋白质相似性度量与层次聚类相结合生成的树具有合理且往往最准确的TBC。我们使用我们的度量来帮助设计TIPS,一种基于凝聚聚类且具有从轮廓评分导出的相似性度量的建树算法。TIPS在分类准确性方面与系统发育算法相当,并且在大型蛋白质家族上速度要快得多。由于其时间可扩展性和可接受的准确性,TIPS正在用于大规模的PANTHER蛋白质分类项目。不同算法针对不同蛋白质家族生成的树可在http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp查看。对于每棵树和每个分类粒度级别,我们提供最优的TBC以及参考分类。
评估TBC准确性的脚本可在http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp获取