Di Filippo Juan I, Rómoli Santiago, Cavasotto Claudio N
Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires 1629, Argentina.
Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar 1629, Argentina.
ACS Omega. 2025 Apr 7;10(15):15598-15609. doi: 10.1021/acsomega.5c00829. eCollection 2025 Apr 22.
Structure-based virtual screening methods are, nowadays, one of the key pillars of computational drug discovery. In recent years, high-throughput docking campaigns aided by machine learning (ML)-based protocols have emerged as a way to accelerate the identification of top-scoring molecules within ultralarge chemical molecule libraries. However, studies validating these ML approaches used one or two targets and/or small molecule libraries. Herein, we extended the validation of ML protocols at retrieving virtual hits in an accelerated fashion by using two standard publicly available ∼100M molecule libraries and also a comprehensive benchmark set involving molecular docking scores of a 10M molecule library in 10 diverse protein targets with two docking programs, PLANTS and AutoDock Vina. In the 10M benchmark set, we have shown that, on average, more than 60 and 70% of the top 10k and top 1k molecules, respectively, can be retrieved while reducing the number of docking evaluations by more than 97%, indicating a robust performance of the ML protocol. With larger molecule libraries, we have shown that a proportional increase in the training set size enhances the performance of the ML model at retrieving virtual hits. In summary, our results support the use of ML methods to retrieve top-scoring molecules for chemical libraries containing hundreds of millions or even billions of molecules, where the role of ML models becomes even more critical as brute-force exploration of such chemical libraries through molecular docking is inaccessible in reasonable time frames.
基于结构的虚拟筛选方法如今是计算药物发现的关键支柱之一。近年来,借助基于机器学习(ML)的协议开展的高通量对接活动已成为一种加速在超大型化学分子库中识别高分值分子的方法。然而,验证这些ML方法的研究仅使用了一两个靶点和/或小分子库。在此,我们通过使用两个标准的公开可用的约1亿分子库以及一个综合基准集(涉及一个1000万分子库在10种不同蛋白质靶点上使用PLANTS和AutoDock Vina这两个对接程序的分子对接分数),扩展了对ML协议以加速方式检索虚拟命中物的验证。在1000万基准集中,我们已经表明,平均而言,分别可以检索到前10000和前1000分子中的60%以上和70%以上,同时将对接评估次数减少97%以上,这表明ML协议具有强大的性能。对于更大的分子库,我们已经表明训练集大小的成比例增加会提高ML模型在检索虚拟命中物方面的性能。总之,我们的结果支持使用ML方法来检索包含数亿甚至数十亿分子的化学库中的高分值分子,在这种情况下,由于在合理的时间框架内无法通过分子对接对如此庞大的化学库进行蛮力探索,ML模型的作用变得更加关键。