用于识别蛋白质家族的近端序列空间聚类

Clustering of proximal sequence space for the identification of protein families.

作者信息

Abascal Federico, Valencia Alfonso

机构信息

Protein Design Group, National Centre for Biotechnology, CNB-CSIC, Cantoblanco, Madrid E-28049, Spain.

出版信息

Bioinformatics. 2002 Jul;18(7):908-21. doi: 10.1093/bioinformatics/18.7.908.

DOI:10.1093/bioinformatics/18.7.908

PMID:12117788

Abstract

MOTIVATION

The study of sequence space, and the deciphering of the structure of protein families and subfamilies, has up to now been required for work in comparative genomics and for the prediction of protein function. With the emergence of structural proteomics projects, it is becoming increasingly important to be able to select protein targets for structural studies that will appropriately cover the space of protein sequences, functions and genomic distribution. These problems are the motivation for the development of methods for clustering protein sequences and building families of potentially orthologous sequences, such as those proposed here.

RESULTS

First we developed a clustering strategy (Ncut algorithm) capable of forming groups of related sequences by assessing their pairwise relationships. The results presented for the ras super-family of proteins are similar to those produced by other clustering methods, but without the need for clustering the full sequence space. The Ncut clusters are then used as the input to a process of reconstruction of groups with equilibrated genomic composition formed by closely-related sequences. The results of applying this technique to the data set used in the construction of the COG database are very similar to those derived by the human experts responsible for this database.

AVAILABILITY

The analysis of different systems, including the COG equivalent 21 genomes are available at http://www.pdg.cnb.uam.es/GenoClustering.html.

摘要

动机

到目前为止，序列空间的研究以及蛋白质家族和亚家族结构的破译对于比较基因组学研究和蛋白质功能预测来说是必不可少的。随着结构蛋白质组学项目的出现，能够为结构研究选择合适覆盖蛋白质序列、功能和基因组分布空间的蛋白质靶点变得越来越重要。这些问题促使人们开发蛋白质序列聚类方法以及构建潜在直系同源序列家族，比如本文所提出的方法。

结果

首先，我们开发了一种聚类策略（Ncut算法），通过评估序列之间的成对关系来形成相关序列组。针对ras蛋白质超家族给出的结果与其他聚类方法产生的结果相似，但无需对整个序列空间进行聚类。然后将Ncut聚类用作重建由密切相关序列形成的基因组组成平衡的组的过程的输入。将该技术应用于构建COG数据库所使用的数据集的结果与负责该数据库的人类专家得出的结果非常相似。