Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Ensenada, Baja California C.P. 22860, Mexico.
Centro de Nanociencias y Nanotecnología, Universidad Nacional Autónoma de México (UNAM), Ensenada, Baja California C.P. 22860, Mexico.
J Chem Inf Model. 2021 Nov 22;61(11):5362-5376. doi: 10.1021/acs.jcim.1c00511. Epub 2021 Oct 15.
One of the main challenges of structure-based virtual screening (SBVS) is the incorporation of the receptor's flexibility, as its explicit representation in every docking run implies a high computational cost. Therefore, a common alternative to include the receptor's flexibility is the approach known as ensemble docking. Ensemble docking consists of using a set of receptor conformations and performing the docking assays over each of them. However, there is still no agreement on how to combine the ensemble docking results to obtain the final ligand ranking. A common choice is to use consensus strategies to aggregate the ensemble docking scores, but these strategies exhibit slight improvement regarding the single-structure approach. Here, we claim that using machine learning (ML) methodologies over the ensemble docking results could improve the predictive power of SBVS. To test this hypothesis, four proteins were selected as study cases: CDK2, FXa, EGFR, and HSP90. Protein conformational ensembles were built from crystallographic structures, whereas the evaluated compound library comprised up to three benchmarking data sets (DUD, DEKOIS 2.0, and CSAR-2012) and cocrystallized molecules. Ensemble docking results were processed through 30 repetitions of 4-fold cross-validation to train and validate two ML classifiers: logistic regression and gradient boosting trees. Our results indicate that the ML classifiers significantly outperform traditional consensus strategies and even the best performance case achieved with single-structure docking. We provide statistical evidence that supports the effectiveness of ML to improve the ensemble docking performance.
基于结构的虚拟筛选(SBVS)的主要挑战之一是结合受体的灵活性,因为在每次对接运行中明确表示受体的灵活性需要很高的计算成本。因此,包含受体灵活性的常用替代方法是所谓的整体对接方法。整体对接包括使用一组受体构象,并对它们中的每一个进行对接测定。然而,对于如何结合整体对接结果以获得最终的配体排名,仍然没有达成共识。一种常见的选择是使用共识策略来聚合整体对接得分,但这些策略在单结构方法方面略有改进。在这里,我们声称,使用机器学习(ML)方法对整体对接结果进行处理可以提高 SBVS 的预测能力。为了验证这一假设,选择了四种蛋白质作为研究案例:CDK2、FXa、EGFR 和 HSP90。从晶体结构构建蛋白质构象集合,而评估的化合物库包含多达三个基准数据集(DUD、DEKOIS 2.0 和 CSAR-2012)和共结晶分子。通过 30 次 4 倍交叉验证处理整体对接结果,以训练和验证两种 ML 分类器:逻辑回归和梯度提升树。我们的结果表明,ML 分类器显著优于传统共识策略,甚至优于单结构对接的最佳性能案例。我们提供了支持 ML 提高整体对接性能的有效性的统计证据。