Chakrabarti Saikat, Lanczycki Christopher J
National Center for Biotechnology Information, National Libary of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Protein Sci. 2007 Jan;16(1):4-13. doi: 10.1110/ps.062506407.
The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation.
蛋白质可获取的序列和结构信息数量迅速增加,这带来了确定其功能重要性的艰巨任务。计算方法在理解和表征这些海量数据中包含的生化和进化信息方面可能非常有用,尤其是在功能重要位点。因此,我们对从广泛的蛋白质家族中提取的大量已知功能重要位点,在分子和生物学功能层面进行了详细的组成和进化限制调查。我们比较了不同功能类别之间的保守程度,并提供详细的统计见解,以解读功能重要位点上不同的进化限制。功能重要位点的组成和进化信息已被汇编成一个功能模板库。我们开发了一个模块,该模块基于检测到与库模板的显著“模板匹配分数”来预测比对中的功能重要列(FIC)。我们的模板匹配分数衡量比对列与库模板的相似性,并将一个明确表示列残基组成的项与各种进化保守分数(信息含量和基于位置特异性评分矩阵得出的统计数据)相结合。我们的基准测试研究表明,该方法在预测功能位点方面具有良好的敏感性/特异性,并且在将正确的分子功能类型归因于预测位点方面具有很高的准确性。这种预测方法基于同源序列衍生的信息,不需要结构信息。因此,这种方法对于大规模功能注释可能极其有用。