Tobi Dror, Bahar Ivet
Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15261, USA.
BMC Bioinformatics. 2007 Jun 28;8:226. doi: 10.1186/1471-2105-8-226.
A wealth of unannotated and functionally unknown protein sequences has accumulated in recent years with rapid progresses in sequence genomics, giving rise to ever increasing demands for developing methods to efficiently assess functional sites. Sequence and structure conservations have traditionally been the major criteria adopted in various algorithms to identify functional sites. Here, we focus on the distributions of the 203 different types of 3-grams (or triplets of sequentially contiguous amino acid) in the entire space of sequences accumulated to date in the UniProt database, and focus in particular on the rare 3-grams distinguished by their high entropy-based information content.
Comparison of the UniProt distributions with those observed near/at the active sites on a non-redundant dataset of 59 enzyme/ligand complexes shows that the active sites preferentially recruit 3-grams distinguished by their low frequency in the UniProt. Three cases, Src kinase, hemoglobin, and tyrosyl-tRNA synthetase, are discussed in details to illustrate the biological significance of the results.
The results suggest that recruitment of rare 3-grams may be an efficient mechanism for increasing specificity at functional sites. Rareness/scarcity emerges as a feature that may assist in identifying key sites for proteins function, providing information complementary to that derived from sequence alignments. In addition it provides us (for the first time) with a means of identifying potentially functional sites from sequence information alone, when sequence conservation properties are not available.
近年来,随着序列基因组学的快速发展,积累了大量未注释且功能未知的蛋白质序列,这使得开发有效评估功能位点的方法的需求不断增加。序列和结构保守性传统上一直是各种算法中用于识别功能位点的主要标准。在这里,我们关注于UniProt数据库中迄今为止积累的序列全空间内203种不同类型的三联体(或连续相邻氨基酸的三元组)的分布情况,尤其关注那些以基于高熵的信息含量而区分的稀有三联体。
将UniProt分布与在59种酶/配体复合物的非冗余数据集中活性位点附近/处观察到的分布进行比较,结果表明活性位点优先招募在UniProt中以低频为特征的三联体。详细讨论了三个案例,即Src激酶、血红蛋白和酪氨酰-tRNA合成酶,以说明结果的生物学意义。
结果表明,招募稀有三联体可能是一种在功能位点提高特异性的有效机制。稀有性/稀缺性成为一种可能有助于识别蛋白质功能关键位点的特征,提供了与序列比对所获得信息互补的信息。此外,当序列保守特性不可用时,它首次为我们提供了仅从序列信息识别潜在功能位点的方法。