通过序列相似性和基于模块性的方法揭示远程蛋白质同源性。

Revealing remote protein homology with sequence similarity and a modularity-based approach.

作者信息

Mei Juan, Yang Xiaojian, Zhou Weican

机构信息

Department of Electronic and Information Technology, Wuxi City College of Vocational Technology 12 Xinou Road, Wuxi 214153, China.

出版信息

Theor Biol Forum. 2011;104(1):57-68.

PMID:22220355

Abstract

An important task in functional genomics is to cluster homologous proteins, which may share common functions. Annotating proteins of unknown function by transferring annotations from their homologues of known annotations is one of the most efficient ways to predict protein function. In this paper, we use a modularity-based method called CD for grouping together homologous proteins. The method employs a global heuristic search strategy to find the partitioning of the weighted adjacency graph with the largest modularity. The weighted adjacency graph is constructed by the sigmodal transformation of all pairwise sequence similarities between all protein sequences in a given dataset. The method has been extensively tested on several subsets from the superfamily level of the SCOP (Structural Classification of Proteins) database, where some homologous proteins have very low sequence similarity. Compared with a widely used method MCL, we observe that the number of clusters obtained by CD is closer to the number of superfamilies in the dataset, the value of the F-measure given by CD is 10% better than MCL on average, and CD is more tolerant to noise to the sequence similarity. The experiment results indicate that CD is ideally suitable for clustering homologous proteins when sequence similarity is low.

摘要

功能基因组学中的一项重要任务是对同源蛋白质进行聚类，这些蛋白质可能具有共同的功能。通过将已知注释的同源物的注释转移到未知功能的蛋白质上来注释蛋白质，是预测蛋白质功能最有效的方法之一。在本文中，我们使用一种基于模块度的方法CD来将同源蛋白质分组在一起。该方法采用全局启发式搜索策略来找到具有最大模块度的加权邻接图的划分。加权邻接图是通过对给定数据集中所有蛋白质序列之间的所有成对序列相似性进行西格玛变换构建的。该方法已经在蛋白质结构分类（SCOP）数据库超家族水平的几个子集上进行了广泛测试，其中一些同源蛋白质的序列相似性非常低。与广泛使用的方法MCL相比我们观察到，CD获得的簇数量更接近数据集中的超家族数量，CD给出的F值平均比MCL高10%，并且CD对序列相似性的噪声更具耐受性。实验结果表明当序列相似性较低时，CD非常适合对同源蛋白质进行聚类。