Retchin Michael, Wang Yuanqing, Takaba Kenichiro, Chodera John D
Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, Cornell University, New York, NY 10065.
Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065.
bioRxiv. 2024 Jun 2:2024.05.28.596296. doi: 10.1101/2024.05.28.596296.
Drug discovery is stochastic. The effectiveness of candidate compounds in satisfying design objectives is unknown ahead of time, and the tools used for prioritization-predictive models and assays-are inaccurate and noisy. In a typical discovery campaign, thousands of compounds may be synthesized and tested before design objectives are achieved, with many others ideated but deprioritized. These challenges are well-documented, but assessing potential remedies has been difficult. We introduce , a framework for modeling the stochastic process of drug discovery. Emulating biochemical assays with realistic surrogate models, we simulate the progression from weak hits to sub-micromolar leads with viable ADME. We use this testbed to examine how different ideation, scoring, and decision-making strategies impact statistical measures of utility, such as the probability of program success within predefined budgets and the expected costs to achieve target candidate profile (TCP) goals. We also assess the influence of affinity model inaccuracy, chemical creativity, batch size, and multi-step reasoning. Our findings suggest that reducing affinity model inaccuracy from 2 to 0.5 pIC50 units improves budget-constrained success rates tenfold. DrugGym represents a realistic testbed for machine learning methods applied to the hit-to-lead phase. Source code is available at www.drug-gym.org.
药物发现是随机的。候选化合物在满足设计目标方面的有效性在事先是未知的,并且用于优先级排序的工具——预测模型和分析方法——不准确且存在噪声。在典型的发现活动中,在实现设计目标之前可能会合成和测试数千种化合物,还有许多其他化合物虽然被构思出来但被降低了优先级。这些挑战有充分的文献记载,但评估潜在的补救措施一直很困难。我们引入了DrugGym,这是一个用于对药物发现的随机过程进行建模的框架。我们用逼真的替代模型模拟生化分析,模拟从弱活性化合物到具有可行药物代谢动力学性质的亚微摩尔级先导化合物的进展。我们使用这个测试平台来研究不同的构思、评分和决策策略如何影响效用的统计指标,例如在预定义预算内项目成功的概率以及实现目标候选物概况(TCP)目标的预期成本。我们还评估了亲和力模型不准确、化学创造性、批量大小和多步推理的影响。我们的研究结果表明,将亲和力模型的不准确度从2个pIC50单位降低到0.5个pIC50单位,可使预算受限的成功率提高十倍。DrugGym为应用于从活性化合物到先导化合物阶段的机器学习方法提供了一个现实的测试平台。源代码可在www.drug-gym.org获取。