Bahir Iris, Linial Michal
Department of Biological Chemistry, Institute of life Sciences, The Hebrew University of Jerusalem, Israel.
Proteins. 2006 Jun 1;63(4):996-1004. doi: 10.1002/prot.20903.
The two ends of each protein are known as the amino (N-) and carboxyl (C-) termini. Short signatures in a protein's termini often carry vital cellular function. No systematic research has been conducted to address the importance of short signatures (3 to 10 amino acids) in protein termini at the proteomic level. Specifically, it is unknown whether such signatures are evolutionarily conserved, and if so, whether this conservation confers shared biological functions. Current signature detection methods fail to detect such short signatures due to inadequate statistical scores. The findings presented in this study strongly support the notion that functional significance of protein sets may be captured by short signatures at their termini. A positional search method was applied to over one million proteins from the UniProt database. The result is a collection of about a thousand significant signature groups (SIGs) that include previously identified as well as many novel signatures in protein termini. These SIGs represent protein sets with minimal or no overall sequence similarity excepting the similarity at their termini. The most significant SIGs are assigned by their strong correspondence to functional annotations derived from external databases such as Gene Ontology. Each of the SIGs is associated with the statistical significance of its functional association. These SIGs provide a valuable source for testing previously overlooked signatures in protein termini and allow for the investigation of the role played by such signatures throughout evolution. The SIGs archive and advanced search options are available at http://www.proteus.cs.huji.ac.il.
每种蛋白质的两端分别称为氨基(N-)端和羧基(C-)端。蛋白质末端的短序列特征通常具有重要的细胞功能。尚未开展系统研究来探讨蛋白质组水平上蛋白质末端短序列特征(3至10个氨基酸)的重要性。具体而言,尚不清楚这些序列特征在进化上是否保守,若保守,这种保守性是否赋予了共同的生物学功能。由于统计得分不足,当前的序列特征检测方法无法检测到此类短序列特征。本研究的结果有力地支持了这一观点,即蛋白质组的功能意义可能由其末端的短序列特征所体现。一种定位搜索方法被应用于来自UniProt数据库的一百多万种蛋白质。结果得到了大约一千个重要的序列特征组(SIGs),其中包括先前已鉴定的以及许多蛋白质末端的新序列特征。这些SIGs代表了除末端相似性外整体序列相似性最小或没有相似性的蛋白质组。最显著的SIGs是根据它们与来自外部数据库(如基因本体论)的功能注释的强烈对应关系来确定的。每个SIGs都与其功能关联的统计显著性相关。这些SIGs为测试蛋白质末端先前被忽视的序列特征提供了宝贵资源,并有助于研究此类序列特征在整个进化过程中所起的作用。SIGs存档和高级搜索选项可在http://www.proteus.cs.huji.ac.il获取。