Department of Computer Science, KU Leuven, Belgium.
Evol Bioinform Online. 2013 May 6;9:185-202. doi: 10.4137/EBO.S11609. Print 2013.
We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.
我们提出了一种新的方法来进行蛋白质亚家族识别任务,即找到蛋白质家族中功能密切相关的序列亚群。与系统发生基因组学分析一致,该方法首先使用蛋白质序列的多重比对作为输入构建层次树,然后使用后剪枝过程从树中提取聚类。与现有方法不同,它自顶向下构建层次树,而不是自底向上,并将特定的突变与每个子聚类的划分相关联。这种方法的动机假设是,它可能会产生更好的树拓扑结构,从而更准确地识别亚家族,并且还可以指示功能重要的位点,并允许对新蛋白质进行轻松分类。彻底的实验评估证实了这一假设。与最先进的方法 SCI-PHY 相比,新方法产生了更准确的聚类和更好的树拓扑结构,能够识别已知的功能位点,并能够识别单独允许对新序列进行分类的突变,其准确性接近隐马尔可夫模型。