Wang Yu, Guo Yanzhi, Kuang Qifan, Pu Xuemei, Ji Yue, Zhang Zhihang, Li Menglong
College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China.
J Comput Aided Mol Des. 2015 Apr;29(4):349-60. doi: 10.1007/s10822-014-9827-y. Epub 2014 Dec 20.
The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients (R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.
评估配体与靶蛋白之间的结合亲和力在药物发现和设计过程中起着至关重要的作用。作为广泛使用的评分方法的替代方法,机器学习方法也已被提出用于快速预测结合亲和力并取得了有前景的结果,但其中大多数是作为通用模型开发的,而忽略了不同蛋白质家族的特定功能,因为来自不同功能家族的蛋白质总是具有不同的结构和物理化学特征。在本研究中,我们提出了一种基于涵盖蛋白质序列、结合口袋、配体结构和分子间相互作用的综合特征集来预测蛋白质-配体结合亲和力的随机森林方法。对不同蛋白质家族数据集分别进行了特征处理和压缩,这表明不同特征对不同模型有贡献,因此每个蛋白质家族需要单独表示。分别为HIV-1蛋白酶、胰蛋白酶和碳酸酐酶这三个重要的蛋白质靶标家族构建了三个家族特异性模型。作为比较,还构建了两个包含不同蛋白质家族的通用模型。评估结果表明,家族特异性数据集上的模型性能优于通用数据集上的模型,HIV-1蛋白酶、胰蛋白酶和碳酸酐酶测试集上的皮尔逊和斯皮尔曼相关系数(Rp和Rs)分别为0.740、0.874、0.735和0.697、0.853、0.723。与其他方法的比较进一步表明,为每个蛋白质家族单独表示和构建模型是预测特定蛋白质家族亲和力的更合理方法。