Iqbal Sumaiya, Hoque Md Tamjidul
Department of Computer Science, University of New Orleans, New Orleans, LA, United States of America.
PLoS One. 2016 Sep 2;11(9):e0161452. doi: 10.1371/journal.pone.0161452. eCollection 2016.
A set of features computed from the primary amino acid sequence of proteins, is crucial in the process of inducing a machine learning model that is capable of accurately predicting three-dimensional protein structures. Solutions for existing protein structure prediction problems are in need of features that can capture the complexity of molecular level interactions. With a view to this, we propose a novel approach to estimate position specific estimated energy (PSEE) of a residue using contact energy and predicted relative solvent accessibility (RSA). Furthermore, we demonstrate PSEE can be reasonably estimated based on sequence information alone. PSEE is useful in identifying the structured as well as unstructured or, intrinsically disordered region of a protein by computing favorable and unfavorable energy respectively, characterized by appropriate threshold. The most intriguing finding, verified empirically, is the indication that the PSEE feature can effectively classify disorder versus ordered residues and can segregate different secondary structure type residues by computing the constituent energies. PSEE values for each amino acid strongly correlate with the hydrophobicity value of the corresponding amino acid. Further, PSEE can be used to detect the existence of critical binding regions that essentially undergo disorder-to-order transitions to perform crucial biological functions. Towards an application of disorder prediction using the PSEE feature, we have rigorously tested and found that a support vector machine model informed by a set of features including PSEE consistently outperforms a model with an identical set of features with PSEE removed. In addition, the new disorder predictor, DisPredict2, shows competitive performance in predicting protein disorder when compared with six existing disordered protein predictors.
从蛋白质的一级氨基酸序列计算得出的一组特征,对于诱导一个能够准确预测三维蛋白质结构的机器学习模型的过程至关重要。现有的蛋白质结构预测问题的解决方案需要能够捕捉分子水平相互作用复杂性的特征。鉴于此,我们提出了一种新颖的方法,利用接触能量和预测的相对溶剂可及性(RSA)来估计残基的位置特异性估计能量(PSEE)。此外,我们证明仅基于序列信息就可以合理估计PSEE。通过分别计算有利和不利能量,并以适当的阈值为特征,PSEE可用于识别蛋白质的结构化区域以及非结构化或内在无序区域。通过实验验证的最有趣的发现是,PSEE特征可以通过计算组成能量有效地对无序残基和有序残基进行分类,并可以区分不同二级结构类型的残基。每个氨基酸的PSEE值与相应氨基酸的疏水性值密切相关。此外,PSEE可用于检测关键结合区域的存在,这些区域基本上会经历从无序到有序的转变以执行关键的生物学功能。为了应用PSEE特征进行无序预测,我们进行了严格测试,发现由包括PSEE在内的一组特征提供信息的支持向量机模型始终优于去除PSEE的相同特征集的模型。此外,新的无序预测器DisPredict2与六个现有的无序蛋白质预测器相比,在预测蛋白质无序方面表现出具有竞争力的性能。