Chen Ke, Kurgan Lukasz A, Ruan Jishou
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada.
BMC Struct Biol. 2007 Apr 16;7:25. doi: 10.1186/1472-6807-7-25.
Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction.
The flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM) and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are characterized by accuracies below 70%. Finally, the Naïve Bayes method is shown to provide the highest sensitivity for the prediction of flexible regions, while FlexRP and SVM give the highest sensitivity for rigid regions.
A new sequence representation that uses k-spaced amino acid pairs is shown to be the most efficient in the prediction of the flexible/rigid regions of protein sequences. The proposed FlexRP method provides the highest prediction accuracy of about 80%. The experimental tests show that the FlexRP and SVM methods achieved high overall accuracy and the highest sensitivity for rigid regions, while the best quality of the predictions for flexible regions is achieved by the Naïve Bayes method.
传统观点认为,蛋白质的天然结构对应于其自由能的全局最小值。然而,随着已知三级(3D)蛋白质结构数量的不断增加,研究人员发现一些蛋白质能够根据周围环境的变化,或在其他蛋白质或配体的帮助下改变其结构。这种结构转变对蛋白质功能起着至关重要的作用。为此,我们提出了一种用于预测蛋白质柔性/刚性区域的机器学习方法(称为FlexRP);该方法基于一种新颖的序列表示和特征选择。了解柔性/刚性区域可能有助于深入了解蛋白质折叠过程和3D结构预测。
基于一个数据集定义了柔性/刚性区域,该数据集包含具有多个实验结构的蛋白质序列,并且先前用于研究蛋白质的结构保守性。从该数据集中提取的序列基于先前研究中提出的特征集进行表示,如PSI-BLAST图谱、组成向量和二元序列编码,以及一种基于k间隔氨基酸对频率的新提出的表示方法。这些表示通过特征选择进行处理以降低维度。将几种用于预测柔性/刚性区域的机器学习方法以及最近提出的两种用于预测构象变化和无结构区域的方法与所提出的方法进行了比较。应用逻辑回归和基于搭配的具有95个特征的表示方法的FlexRP方法,准确率达到79.5%。另外两种亚军方法,应用相同的序列表示以及支持向量机(SVM)和朴素贝叶斯分类器,准确率分别为79.2%和78.4%。其余考虑的方法准确率均低于70%。最后,朴素贝叶斯方法在预测柔性区域时显示出最高的灵敏度,而FlexRP和SVM在预测刚性区域时具有最高的灵敏度。
一种使用k间隔氨基酸对的新序列表示在预测蛋白质序列的柔性/刚性区域方面被证明是最有效的。所提出的FlexRP方法提供了约80%的最高预测准确率。实验测试表明,FlexRP和SVM方法实现了较高的总体准确率以及对刚性区域的最高灵敏度,而朴素贝叶斯方法在预测柔性区域方面具有最佳的预测质量。