Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824-1226, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1301-13. doi: 10.1109/TCBB.2012.36.
Accurately predicting the binding affinities of large sets of protein-ligand complexes efficiently is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein's binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited ranking accuracy has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes. We assess the ranking accuracies of these new ML-based SFs as well as those of conventional SFs in the context of the 2007 and 2010 PDBbind benchmark data sets on both diverse and protein-family-specific test sets. We also investigate the influence of the size of the training data set and the type and number of features used on ranking accuracy. Within clusters of protein-ligand complexes with different ligands bound to the same target protein, we find that the best ML-based SF is able to rank the ligands correctly based on their experimentally determined binding affinities 62.5 percent of the time and identify the top binding ligand 78.1 percent of the time. For this SF, the Spearman correlation coefficient between ranks of ligands ordered by predicted and experimentally determined binding affinities is 0.771. Given the challenging nature of the ranking problem and that SFs are used to screen millions of ligands, this represents a significant improvement over the best conventional SF we studied, for which the corresponding ranking performance values are 57.8 percent, 73.4 percent, and 0.677.
准确预测大量蛋白质-配体复合物的结合亲和力是计算生物分子科学中的一个关键挑战,其应用包括药物发现、化学生物学和结构生物学。由于评分函数 (SF) 用于对配体候选物与蛋白质结合位点的亲和力进行评分、排序和识别,因此它预测配体亲和力的准确性对虚拟筛选的准确性有重大影响。尽管在开发基于力场、基于知识或基于经验的传统 SF 方面付出了巨大努力,但它们有限的排序准确性一直是实现具有成本效益的药物发现的主要障碍。因此,在这项工作中,我们探索了一系列新的 SF,这些 SF 结合了不同的机器学习 (ML) 方法,以及各种描述蛋白质-配体复合物的物理化学和几何特征。我们评估了这些新的基于 ML 的 SF 以及传统 SF 在 2007 年和 2010 年 PDBbind 基准数据集的多种不同和蛋白质家族特定测试集上的排序准确性。我们还研究了训练数据集的大小以及使用的特征类型和数量对排序准确性的影响。在具有不同配体结合到同一靶蛋白的蛋白质-配体复合物的簇中,我们发现,基于实验测定的结合亲和力,最好的基于 ML 的 SF 能够正确地对配体进行排序,其正确排序的比例为 62.5%,正确识别最佳结合配体的比例为 78.1%。对于这个 SF,根据预测和实验测定的结合亲和力对配体进行排序的秩之间的斯皮尔曼相关系数为 0.771。考虑到排序问题的挑战性以及 SF 用于筛选数百万种配体的情况,这与我们研究的最佳传统 SF 相比有了显著的提高,对于后者,相应的排序性能值分别为 57.8%、73.4%和 0.677。