Pan Xiao-Yong, Shen Hong-Bin
Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai, China.
Protein Pept Lett. 2009;16(12):1447-54. doi: 10.2174/092986609789839250.
B-factor is highly correlated with protein internal motion, which is used to measure the uncertainty in the position of an atom within a crystal structure. Although the rapid progress of structural biology in recent years makes more accurate protein structures available than ever, with the avalanche of new protein sequences emerging during the post-genomic Era, the gap between the known protein sequences and the known protein structures becomes wider and wider. It is urgent to develop automated methods to predict B-factor profile from the amino acid sequences directly, so as to be able to timely utilize them for basic research. In this article, we propose a novel approach, called PredBF, to predict the real value of B-factor. We firstly extract both global and local features from the protein sequences as well as their evolution information, then the random forests feature selection is applied to rank their importance and the most important features are inputted to a two-stage support vector regression (SVR) for prediction, where the initial predicted outputs from the 1(st) SVR are further inputted to the 2nd layer SVR for final refinement. Our results have revealed that a systematic analysis of the importance of different features makes us have deep insights into the different contributions of features and is very necessary for developing effective B-factor prediction tools. The two-layer SVR prediction model designed in this study further enhanced the robustness of predicting the B-factor profile. As a web server, PredBF is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/PredBF for academic use.
B因子与蛋白质内部运动高度相关,蛋白质内部运动用于衡量晶体结构中原子位置的不确定性。尽管近年来结构生物学取得了快速进展,可获得比以往任何时候都更准确的蛋白质结构,但在后基因组时代,随着新蛋白质序列如雪崩般涌现,已知蛋白质序列与已知蛋白质结构之间的差距越来越大。迫切需要开发直接从氨基酸序列预测B因子分布的自动化方法,以便能够及时将它们用于基础研究。在本文中,我们提出了一种名为PredBF的新方法来预测B因子的实际值。我们首先从蛋白质序列中提取全局和局部特征及其进化信息,然后应用随机森林特征选择对其重要性进行排序,并将最重要的特征输入到两阶段支持向量回归(SVR)中进行预测,其中第一层SVR的初始预测输出进一步输入到第二层SVR中进行最终优化。我们的结果表明,对不同特征的重要性进行系统分析使我们能够深入了解特征的不同贡献,这对于开发有效的B因子预测工具非常必要。本研究设计的两层SVR预测模型进一步增强了预测B因子分布的稳健性。作为一个网络服务器,PredBF可在以下网址免费获取以供学术使用:http://www.csbio.sjtu.edu.cn/bioinf/PredBF 。