Pandit Shashi B, Gosar Dilip, Abhiman S, Sujatha S, Dixit Sayali S, Mhatre Natasha S, Sowdhamini R, Srinivasan N
Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India.
Nucleic Acids Res. 2002 Jan 1;30(1):289-93. doi: 10.1093/nar/30.1.289.
Members of a superfamily of proteins could result from divergent evolution of homologues with insignificant similarity in the amino acid sequences. A superfamily relationship is detected commonly after the three-dimensional structures of the proteins are determined using X-ray analysis or NMR. The SUPFAM database described here relates two homologous protein families in a multiple sequence alignment database of either known or unknown structure. The present release (1.1), which is the first version of the SUPFAM database, has been derived by analysing Pfam, which is one of the commonly used databases of multiple sequence alignments of homologous proteins. The first step in establishing SUPFAM is to relate Pfam families with the families in PALI, which is an alignment database of homologous proteins of known structure that is derived largely from SCOP. The second step involves relating Pfam families which could not be associated reliably with a protein superfamily of known structure. The profile matching procedure, IMPALA, has been used in these steps. The first step resulted in identification of 1280 Pfam families (out of 2697, i.e. 47%) which are related, either by close homologous connection to a SCOP family or by distant relationship to a SCOP family, potentially forming new superfamily connections. Using the profiles of 1417 Pfam families with apparently no structural information, an all-against-all comparison involving a sequence-profile match using IMPALA resulted in clustering of 67 homologous protein families of Pfam into 28 potential new superfamilies. Expansion of groups of related proteins of yet unknown structural information, as proposed in SUPFAM, should help in identifying 'priority proteins' for structure determination in structural genomics initiatives to expand the coverage of structural information in the protein sequence space. For example, we could assign 858 distinct Pfam domains in 2203 of the gene products in the genome of Mycobacterium tubercolosis. Fifty-one of these Pfam families of unknown structure could be clustered into 17 potentially new superfamilies forming good targets for structural genomics. SUPFAM database can be accessed at http://pauling.mbu.iisc.ernet.in/~supfam.
蛋白质超家族的成员可能源于氨基酸序列相似度很低的同源物的趋异进化。通常在通过X射线分析或核磁共振确定蛋白质的三维结构后,才能检测到超家族关系。这里描述的SUPFAM数据库在已知或未知结构的多序列比对数据库中关联了两个同源蛋白质家族。当前版本(1.1)是SUPFAM数据库的首个版本,它是通过分析Pfam推导而来的,Pfam是常用的同源蛋白质多序列比对数据库之一。建立SUPFAM的第一步是将Pfam家族与PALI中的家族相关联,PALI是一个已知结构的同源蛋白质比对数据库,主要源自SCOP。第二步涉及关联那些无法可靠地与已知结构的蛋白质超家族相关联的Pfam家族。在这些步骤中使用了轮廓匹配程序IMPALA。第一步确定了1280个Pfam家族(在2697个中,即47%),它们通过与SCOP家族的紧密同源连接或与SCOP家族的远缘关系相关联,有可能形成新的超家族连接。使用1417个显然没有结构信息的Pfam家族的轮廓,通过IMPALA进行的全对全序列-轮廓匹配比较,将67个Pfam同源蛋白质家族聚类为28个潜在的新超家族。如SUPFAM中所提议的,扩展结构信息未知的相关蛋白质组,应有助于在结构基因组学计划中确定用于结构测定的“优先蛋白质”,以扩大蛋白质序列空间中结构信息的覆盖范围。例如,我们可以在结核分枝杆菌基因组的2203个基因产物中确定858个不同的Pfam结构域。其中51个结构未知的Pfam家族可以聚类为17个潜在的新超家族,成为结构基因组学的良好目标。可通过http://pauling.mbu.iisc.ernet.in/~supfam访问SUPFAM数据库。