Agrawal Piyush, Mishra Gaurav, Raghava Gajendra P S
Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Bioinformatics Center, CSIR-Institute of Microbial Technology, Chandigarh, India.
Front Pharmacol. 2020 Jan 30;10:1690. doi: 10.3389/fphar.2019.01690. eCollection 2019.
S-adenosyl-L-methionine (SAM) is an essential cofactor present in the biological system and plays a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. To the best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence.
This manuscript describes a method SAMbinder, developed for predicting SAM interacting residue in a protein from its primary sequence. All models were trained, tested, and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced data set containing 2,188 SAM interacting and an equal number of non-interacting residues. Our random forest based model developed using binary profile feature got maximum Matthews Correlation Coefficient (MCC) 0.42 with area under receiver operating characteristics (AUROC) 0.79 on the validation data set. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of the position-specific scoring matrix (PSSM) profile is used as a feature. We also developed models on a realistic data set containing 2,188 SAM interacting and 40,029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique.
S-腺苷-L-甲硫氨酸(SAM)是生物系统中存在的一种必需辅因子,在许多疾病中起关键作用。需要开发一种预测蛋白质中SAM结合位点的方法,以设计针对与SAM相关疾病的药物。据我们所知,没有方法能够预测给定蛋白质序列中SAM的结合位点。
本文描述了一种名为SAMbinder的方法,用于从蛋白质的一级序列预测其中与SAM相互作用的残基。所有模型均在145条SAM结合蛋白链上进行训练、测试和评估,其中任意两条链的序列相似度不超过40%。首先,在一个包含2188个与SAM相互作用残基和等量非相互作用残基的平衡数据集上,使用不同的机器学习技术开发模型。我们基于随机森林的模型利用二元轮廓特征,在验证数据集上获得了最大马修斯相关系数(MCC)0.42,受试者工作特征曲线下面积(AUROC)为0.79。当使用位置特异性评分矩阵(PSSM)轮廓形式的进化信息作为特征时,我们模型的性能从MCC 0.42显著提高到0.61。我们还在一个包含2188个与SAM相互作用残基和40029个非相互作用残基的真实数据集上开发了模型,获得了最大MCC 0.61,AUROC为0.89。为了评估我们模型的性能,我们使用了内部和外部交叉验证技术。