Loeffler Hannes H, Wan Shunzhou, Klähn Marco, Bhati Agastya P, Coveney Peter V
Molecular AI, Discovery Sciences, R&D, AstraZeneca, Mölndal 431 83, Sweden.
Centre for Computational Science, Department of Chemistry, University College London, London WC1H 0AJ, U.K.
J Chem Theory Comput. 2024 Sep 3;20(18):8308-28. doi: 10.1021/acs.jctc.4c00576.
Active learning (AL) is a specific instance of sequential experimental design and uses machine learning to intelligently choose the next data point or batch of molecular structures to be evaluated. In this sense, it closely mimics the iterative design-make-test-analysis cycle of laboratory experiments to find optimized compounds for a given design task. Here, we describe an AL protocol which combines generative molecular AI, using REINVENT, and physics-based absolute binding free energy molecular dynamics simulation, using ESMACS, to discover new ligands for two different target proteins, 3CL and TNKS2. We have deployed our generative active learning (GAL) protocol on Frontier, the world's only exa-scale machine. We show that the protocol can find higher-scoring molecules compared to the baseline, a surrogate ML docking model for 3CL and compounds with experimentally determined binding affinities for TNKS2. The ligands found are also chemically diverse and occupy a different chemical space than the baseline. We vary the batch sizes that are put forward for free energy assessment in each GAL cycle to assess the impact on their efficiency on the GAL protocol and recommend their optimal values in different scenarios. Overall, we demonstrate a powerful capability of the combination of physics-based and AI methods which yields effective chemical space sampling at an unprecedented scale and is of immediate and direct relevance to modern, data-driven drug discovery.
主动学习(AL)是序列实验设计的一个具体实例,它利用机器学习智能地选择下一个要评估的数据点或一批分子结构。从这个意义上说,它紧密模仿了实验室实验的迭代设计-制造-测试-分析循环,以找到针对给定设计任务的优化化合物。在这里,我们描述了一种主动学习协议,该协议结合了使用REINVENT的生成式分子人工智能和使用ESMACS的基于物理的绝对结合自由能分子动力学模拟,以发现两种不同靶蛋白3CL和TNKS2的新配体。我们已经在世界上唯一的百亿亿次级计算机Frontier上部署了我们的生成式主动学习(GAL)协议。我们表明,与基线相比,该协议可以找到得分更高的分子,基线是3CL的替代机器学习对接模型以及具有实验确定的TNKS2结合亲和力的化合物。所发现的配体在化学上也具有多样性,并且占据与基线不同的化学空间。我们在每个GAL循环中改变提出用于自由能评估的批次大小,以评估其对GAL协议效率的影响,并推荐它们在不同场景下的最佳值。总体而言,我们展示了基于物理的方法和人工智能方法相结合的强大能力,这种结合以前所未有的规模实现了有效的化学空间采样,并且与现代数据驱动的药物发现直接相关。