Devos D, Valencia A
Protein Design Group, CNB-CSIC, Madrid, Spain.
Proteins. 2000 Oct 1;41(1):98-107.
The widening gap between known protein sequences and their functions has led to the practice of assigning a potential function to a protein on the basis of sequence similarity to proteins whose function has been experimentally investigated. We present here a critical view of the theoretical and practical bases for this approach. The results obtained by analyzing a significant number of true sequence similarities, derived directly from structural alignments, point to the complexity of function prediction. Different aspects of protein function, including (i) enzymatic function classification, (ii) functional annotations in the form of key words, (iii) classes of cellular function, and (iv) conservation of binding sites can only be reliably transferred between similar sequences to a modest degree. The reason for this difficulty is a combination of the unavoidable database inaccuracies and the plasticity of protein function. In addition, analysis of the relationship between sequence and functional descriptions defines an empirical limit for pairwise-based functional annotations, namely, the three first digits of the six numbers used as descriptors of protein folds in the FSSP database can be predicted at an average level as low as 7.5% sequence identity, two of the four EC digits at 15% identity, half of the SWISS-PROT key words related to protein function would require 20% identity, and the prediction of half of the residues in the binding site can be made at the 30% sequence identity level.
已知蛋白质序列与其功能之间日益扩大的差距,导致了基于与功能已通过实验研究的蛋白质的序列相似性来赋予蛋白质潜在功能的做法。在此,我们对这种方法的理论和实践基础提出批判性观点。通过分析大量直接源自结构比对的真实序列相似性所获得的结果,指出了功能预测的复杂性。蛋白质功能的不同方面,包括(i)酶功能分类,(ii)关键词形式的功能注释,(iii)细胞功能类别,以及(iv)结合位点的保守性,只能在相似序列之间以适度程度可靠地转移。造成这种困难的原因是不可避免的数据库不准确以及蛋白质功能的可塑性。此外,对序列与功能描述之间关系的分析确定了基于成对的功能注释的经验极限,即,在FSSP数据库中用作蛋白质折叠描述符的六个数字中的前三位数字,在序列同一性低至7.5%的平均水平下即可预测,四个EC数字中的两个在同一性为15%时可预测,与蛋白质功能相关的SWISS-PROT关键词的一半需要20%的同一性,并且在30%的序列同一性水平下可对结合位点中一半的残基进行预测。