Adamczak Rafał, Porollo Aleksey, Meller Jarosław
Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, Ohio 45229, USA.
Proteins. 2005 May 15;59(3):467-75. doi: 10.1002/prot.20441.
Owing to the use of evolutionary information and advanced machine learning protocols, secondary structures of amino acid residues in proteins can be predicted from the primary sequence with more than 75% per-residue accuracy for the 3-state (i.e., helix, beta-strand, and coil) classification problem. In this work we investigate whether further progress may be achieved by incorporating the relative solvent accessibility (RSA) of an amino acid residue as a fingerprint of the overall topology of the protein. Toward that goal, we developed a novel method for secondary structure prediction that uses predicted RSA in addition to attributes derived from evolutionary profiles. Our general approach follows the 2-stage protocol of Rost and Sander, with a number of Elman-type recurrent neural networks (NNs) combined into a consensus predictor. The RSA is predicted using our recently developed regression-based method that provides real-valued RSA, with the overall correlation coefficients between the actual and predicted RSA of about 0.66 in rigorous tests on independent control sets. Using the predicted RSA, we were able to improve the performance of our secondary structure prediction by up to 1.4% and achieved the overall per-residue accuracy between 77.0% and 78.4% for the 3-state classification problem on different control sets comprising, together, 603 proteins without homology to proteins included in the training. The effects of including solvent accessibility depend on the quality of RSA prediction. In the limit of perfect prediction (i.e., when using the actual RSA values derived from known protein structures), the accuracy of secondary structure prediction increases by up to 4%. We also observed that projecting real-valued RSA into 2 discrete classes with the commonly used threshold of 25% RSA decreases the classification accuracy for secondary structure prediction. While the level of improvement of secondary structure prediction may be different for prediction protocols that implicitly account for RSA in other ways, we conclude that an increase in the 3-state classification accuracy may be achieved when combining RSA with a state-of-the-art protocol utilizing evolutionary profiles. The new method is available through a Web server at http://sable.cchmc.org.
由于使用了进化信息和先进的机器学习协议,对于三态(即螺旋、β链和卷曲)分类问题,蛋白质中氨基酸残基的二级结构可以从一级序列中预测出来,每个残基的预测准确率超过75%。在这项工作中,我们研究了通过纳入氨基酸残基的相对溶剂可及性(RSA)作为蛋白质整体拓扑结构的指纹,是否可以取得进一步的进展。为了实现这一目标,我们开发了一种新的二级结构预测方法,该方法除了使用从进化谱导出的属性外,还使用预测的RSA。我们的一般方法遵循Rost和Sander的两阶段协议,将多个Elman型递归神经网络(NN)组合成一个一致性预测器。使用我们最近开发的基于回归的方法预测RSA,该方法提供实值RSA,在对独立控制集的严格测试中,实际RSA与预测RSA之间的总体相关系数约为0.66。使用预测的RSA,我们能够将二级结构预测的性能提高多达1.4%,并且在不同的控制集上,对于三态分类问题,实现了每个残基的总体准确率在77.0%至78.4%之间,这些控制集总共包含603个与训练中包含的蛋白质无同源性的蛋白质。纳入溶剂可及性的效果取决于RSA预测的质量。在完美预测的极限情况下(即使用从已知蛋白质结构导出的实际RSA值时),二级结构预测的准确率提高多达4%。我们还观察到,将实值RSA投影到具有常用的25%RSA阈值的两个离散类别中会降低二级结构预测的分类准确率。虽然对于以其他方式隐含考虑RSA的预测协议,二级结构预测的改进水平可能不同,但我们得出结论,当将RSA与利用进化谱的最新协议相结合时,可以提高三态分类的准确率。新方法可通过网页服务器http://sable.cchmc.org获得。