Rana Prashant Singh, Sharma Harish, Bhattacharya Mahua, Shukla Anupam
Department of Information Communication and Technology, ABV-Indian Institute of Information Technology and Management, Gwalior MP-474015, India.
J Bioinform Comput Biol. 2015 Apr;13(2):1550005. doi: 10.1142/S0219720015500055. Epub 2014 Dec 19.
Physicochemical properties of proteins always guide to determine the quality of the protein structure, therefore it has been rigorously used to distinguish native or native-like structure from other predicted structures. In this work, we explore nine machine learning methods with six physicochemical properties to predict the Root Mean Square Deviation (RMSD), Template Modeling (TM-score), and Global Distance Test (GDT_TS-score) of modeled protein structure in the absence of its true native state. Physicochemical properties namely total surface area, euclidean distance (ED), total empirical energy, secondary structure penalty (SS), sequence length (SL), and pair number (PN) are used. There are a total of 95,091 modeled structures of 4896 native targets. A real coded Self-adaptive Differential Evolution algorithm (SaDE) is used to determine the feature importance. The K-fold cross validation is used to measure the robustness of the best predictive method. Through the intensive experiments, it is found that Random Forest method outperforms over other machine learning methods. This work makes the prediction faster and inexpensive. The performance result shows the prediction of RMSD, TM-score, and GDT_TS-score on Root Mean Square Error (RMSE) as 1.20, 0.06, and 0.06 respectively; correlation scores are 0.96, 0.92, and 0.91 respectively; R(2) are 0.92, 0.85, and 0.84 respectively; and accuracy are 78.82% (with ± 0.1 err), 86.56% (with ± 0.1 err), and 87.37% (with ± 0.1 err) respectively on the testing data set. The data set used in the study is available as supplement at http://bit.ly/RF-PCP-DataSets.
蛋白质的物理化学性质始终指导着蛋白质结构质量的判定,因此它一直被严格用于区分天然或类天然结构与其他预测结构。在这项工作中,我们探索了九种机器学习方法,并结合六种物理化学性质,在没有真实天然状态的情况下预测建模蛋白质结构的均方根偏差(RMSD)、模板建模(TM-score)和全局距离测试(GDT_TS-score)。所使用的物理化学性质包括总表面积、欧几里得距离(ED)、总经验能量、二级结构惩罚(SS)、序列长度(SL)和对数(PN)。共有4896个天然靶点的95091个建模结构。使用实值编码的自适应差分进化算法(SaDE)来确定特征重要性。采用K折交叉验证来衡量最佳预测方法的稳健性。通过深入实验发现,随机森林方法优于其他机器学习方法。这项工作使预测更快且成本更低。性能结果表明,在测试数据集上,RMSD、TM-score和GDT_TS-score的均方根误差(RMSE)预测分别为1.20、0.06和0.06;相关分数分别为0.96、0.92和0.91;R(2)分别为0.92、0.85和0.84;准确率分别为78.82%(误差±0.1)、86.56%(误差±0.1)和87.37%(误差±0.1)。该研究中使用的数据集可在http://bit.ly/RF-PCP-DataSets上作为补充获取。