Kundu Indra, Paul Goutam, Banerjee Raja
Department of Bioinformatics, Maulana Abul Kalam Azad University of Technology (formerly known as West Bengal University of Technology) Kolkata India
Indian Statistical Institute Kolkata India
RSC Adv. 2018 Mar 28;8(22):12127-12137. doi: 10.1039/c8ra00003d. eCollection 2018 Mar 26.
There is an exigency of transformation of the enormous amount of biological data available in various forms into some significant knowledge. We have tried to implement Machine Learning (ML) algorithm models on the protein-ligand binding affinity data already available to predict the binding affinity of the unknown. ML methods are appreciably faster and cheaper as compared to traditional experimental methods or computational scoring approaches. The prerequisites of this prediction are sufficient and unbiased features of training data and a prediction model which can fit the data well. In our study, we have applied Random forest and Gaussian process regression algorithms from the Weka package on protein-ligand binding affinity, which encompasses protein and ligand binding information from PdbBind database. The models are trained on the basis of selective fundamental information of both proteins and ligand, which can be effortlessly fetched from online databases or can be calculated with the availability of structure. The assessment of the models was made on the basis of correlation coefficient ( ) and root mean square error (RMSE). The Random forest model gave and RMSE of 0.76 and 1.31 respectively. We have also used our features and prediction models on the dataset used by others and found that our model with our features outperformed the existing ones.
迫切需要将以各种形式存在的大量生物数据转化为有意义的知识。我们已尝试在现有的蛋白质-配体结合亲和力数据上实施机器学习(ML)算法模型,以预测未知物的结合亲和力。与传统实验方法或计算评分方法相比,ML方法明显更快且更便宜。这种预测的前提是训练数据具有足够且无偏差的特征以及能够很好拟合数据的预测模型。在我们的研究中,我们将来自Weka软件包的随机森林和高斯过程回归算法应用于蛋白质-配体结合亲和力,该亲和力包含来自PdbBind数据库的蛋白质和配体结合信息。这些模型是基于蛋白质和配体的选择性基本信息进行训练的,这些信息可以轻松地从在线数据库中获取,或者根据结构的可用性进行计算。基于相关系数( )和均方根误差(RMSE)对模型进行评估。随机森林模型的相关系数和RMSE分别为0.76和1.31。我们还在其他人使用的数据集上使用了我们的特征和预测模型,发现具有我们特征的模型优于现有模型。