Zhao Dengwei, Zhou Jingyuan, Tu Shikui, Xu Lei
IEEE/ACM Trans Comput Biol Bioinform. 2024 Nov-Dec;21(6):2459-2470. doi: 10.1109/TCBB.2024.3477592. Epub 2024 Dec 10.
Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A search, path consistency (PC), i.e., values on one optimal path should be identical, is employed as the objective function in the update of the value estimator to train a multi-objective de novo drug designer. By incorporating the value into the decision-making process of beam search, the DrugBA algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-the-art algorithm QADD in multiple molecular properties of the generated molecules.
在广阔的化学空间中从头生成高质量且类似药物的分子是药物发现领域的一项重大挑战。在先前的研究中,基于价值的强化学习算法已被用于迭代生成具有多种期望属性的分子。即时奖励被定义为在每一步对中间状态分子的评估,学习目标是使生成路径上所有分子的预期累积评估分数最大化。然而,这种奖励定义具有误导性,因为实际上,优化目标应该只是最终生成分子的评估分数。此外,在先前的工作中,随机性被引入到决策过程中,这使得能够生成多样化的分子,但不再追求最大的未来奖励。在本文中,即时奖励被定义为通过修改分子所实现的改进,以专门最大化最终生成分子的评估分数。源于A搜索,路径一致性(PC),即在一条最优路径上的值应该相同,被用作价值估计器更新中的目标函数,以训练一个多目标从头药物设计器。通过将该价值纳入束搜索的决策过程,提出了DrugBA算法,以实现大规模生成兼具高质量和多样性的分子。实验结果表明,在生成分子的多种分子属性方面,与现有算法QADD相比有显著提升。