Nahal Yasmine, Menke Janosch, Martinelli Julien, Heinonen Markus, Kabeshov Mikhail, Janet Jon Paul, Nittinger Eva, Engkvist Ola, Kaski Samuel
Department of Computer Science, Aalto University, 02150, Espoo, Finland.
Molecular AI, Discovery Sciences, R&D, AstraZeneca, 431 83, Mölndal, Sweden.
J Cheminform. 2024 Dec 9;16(1):138. doi: 10.1186/s13321-024-00924-y.
Machine learning (ML) systems have enabled the modelling of quantitative structure-property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules. SCIENTIFIC CONTRIBUTION: We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.
机器学习(ML)系统能够利用现有的实验数据对定量构效关系(QSPR)和构效关系(QSAR)进行建模,以预测新分子的目标性质。这些性质预测器在通过引导生成式人工智能(AI)代理探索所需化学空间来加速药物发现方面具有巨大潜力。然而,由于训练数据范围有限,它们往往难以进行泛化。当由生成代理进行优化时,这种局限性可能导致生成具有人为高预测概率满足目标性质的分子,而这些分子随后无法通过实验验证。为应对这一挑战,我们提出一种自适应方法,该方法整合主动学习(AL)和迭代反馈来优化性质预测器,从而改善生成式AI代理对其进行优化的结果。我们的方法利用预期预测信息增益(EPIG)标准来选择额外的分子以供神谕进行评估。这一过程旨在最大程度地降低预测不确定性,从而能够对随后生成的分子进行更准确的模型评估。由于时间和后勤限制,认识到立即进行湿实验室或基于物理的实验不切实际,我们建议利用人类专家的成本效益和领域知识来有效地增强性质预测器,弥合有限训练数据中的差距。通过模拟和实际的人在回路实验进行的实证评估表明,我们的方法优化了性质预测器,使其与神谕评估更好地一致。此外,我们观察到预测性质的准确性有所提高,并且在排名靠前的生成分子中药物相似性也有所提高。科学贡献:我们提出了一个适应性框架,该框架整合了主动学习和人类专业知识,以优化性质预测器用于目标导向的分子生成。这种方法对人类反馈中的噪声具有鲁棒性,并确保使用经过人类优化的预测器在化学空间中导航时利用人类洞察力来识别不仅满足预测性质概况而且在神谕模型上得分很高的分子。此外,它优先考虑诸如药物相似性、合成可及性以及在探索多样化学空间和利用与现有训练数据的相似性之间取得良好平衡等实际特征。