Li Shuyan, Xi Lili, Wang Chengqi, Li Jiazhong, Lei Beilei, Liu Huanxiang, Yao Xiaojun
Department of Chemistry, Lanzhou University, Lanzhou 730000, China.
J Comput Chem. 2009 Apr 30;30(6):900-9. doi: 10.1002/jcc.21078.
In this study, a novel method was developed to predict the binding affinity of protein-ligand based on a comprehensive set of structurally diverse protein-ligand complexes (PLCs). The 1300 PLCs with binding affinity (493 complexes with K(d) and 807 complexes with K(i)) from the refined dataset of PDBbind Database (release 2007) were studied in the predictive model development. In this method, each complex was described using calculated descriptors from three blocks: protein sequence, ligand structure, and binding pocket. Thereafter, the PLCs data were rationally split into representative training and test sets by full consideration of the validation of the models. The molecular descriptors relevant to the binding affinity were selected using the ReliefF method combined with least squares support vector machines (LS-SVMs) modeling method based on the training data set. Two final optimized LS-SVMs models were developed using the selected descriptors to predict the binding affinities of K(d) and K(i). The correlation coefficients (R) of training set and test set for K(d) model were 0.890 and 0.833. The corresponding correlation coefficients for the K(i) model were 0.922 and 0.742, respectively. The prediction method proposed in this work can give better generalization ability than other recently published methods and can be used as an alternative fast filter in the virtual screening of large chemical database.
在本研究中,基于一组全面的结构多样的蛋白质-配体复合物(PLCs),开发了一种预测蛋白质-配体结合亲和力的新方法。在预测模型开发过程中,研究了来自PDBbind数据库(2007年发布)精炼数据集中的1300个具有结合亲和力的PLCs(493个具有解离常数K(d)的复合物和807个具有抑制常数K(i)的复合物)。在该方法中,每个复合物使用从三个模块计算得到的描述符进行描述:蛋白质序列、配体结构和结合口袋。此后,充分考虑模型的验证,将PLCs数据合理地划分为具有代表性的训练集和测试集。基于训练数据集,使用ReliefF方法结合最小二乘支持向量机(LS-SVMs)建模方法选择与结合亲和力相关的分子描述符。使用选定的描述符开发了两个最终优化的LS-SVMs模型,以预测K(d)和K(i)的结合亲和力。K(d)模型训练集和测试集的相关系数(R)分别为0.890和0.833。K(i)模型的相应相关系数分别为0.922和0.742。本工作中提出的预测方法比其他最近发表的方法具有更好的泛化能力,可作为大型化学数据库虚拟筛选中的一种替代快速筛选工具。