基于多元线性回归的蛋白质溶剂可及性预测与进化信息分析
Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression.
作者信息
Wang Jung-Ying, Lee Hahn-Ming, Ahmad Shandar
机构信息
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.
出版信息
Proteins. 2005 Nov 15;61(3):481-91. doi: 10.1002/prot.20620.
A multiple linear regression method was applied to predict real values of solvent accessibility from the sequence and evolutionary information. This method allowed us to obtain coefficients of regression and correlation between the occurrence of an amino-acid residue at a specific target and its sequence neighbor positions on the one hand, and the solvent accessibility of that residue on the other. Our linear regression model based on sequence information and evolutionary models was found to predict residue accessibility with 18.9% and 16.2% mean absolute error respectively, which is better than or comparable to the best available methods. A correlation matrix for several neighbor positions to examine the role of evolutionary information at these positions has been developed and analyzed. As expected, the effective frequency of hydrophobic residues at target positions shows a strong negative correlation with solvent accessibility, whereas the reverse is true for charged and polar residues. The correlation of solvent accessibility with effective frequencies at neighboring positions falls abruptly with distance from target residues. Longer protein chains have been found to be more accurately predicted than their smaller counterparts.
应用多元线性回归方法从序列和进化信息预测溶剂可及性的实际值。该方法使我们能够一方面获得特定目标位置上氨基酸残基与其序列相邻位置的出现情况之间的回归系数和相关性,另一方面获得该残基的溶剂可及性。我们基于序列信息和进化模型的线性回归模型分别以18.9%和16.2%的平均绝对误差预测残基可及性,这优于或可与现有最佳方法相媲美。已经开发并分析了一个用于检查这些位置上进化信息作用的几个相邻位置的相关矩阵。正如预期的那样,目标位置上疏水残基的有效频率与溶剂可及性呈强烈负相关,而带电和极性残基则相反。溶剂可及性与相邻位置有效频率的相关性随着与目标残基距离的增加而急剧下降。已发现较长的蛋白质链比较小的蛋白质链能得到更准确的预测。