Dunn Ian, Koes David Ryan
Dept. of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260.
ArXiv. 2024 Nov 25:arXiv:2411.16644v1.
Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at https://github.com/dunni3/FlowMol.
能够生成新型分子结构的深度生成模型有促进化学发现的潜力。流匹配是最近提出的一种生成建模框架,它在包括生物分子结构相关任务在内的各种任务上都取得了令人瞩目的性能。开创性的流匹配框架最初仅针对连续数据开发。然而,分子设计任务需要生成离散数据,如原子元素或氨基酸残基序列。最近已经提出了几种离散流匹配方法来弥补这一差距。在这项工作中,我们对用于三维小分子生成的现有离散流匹配方法的性能进行了基准测试,并解释了它们不同行为的原因。结果,我们提出了FlowMol-CTMC,这是一个开源模型,在三维设计中实现了比现有方法更少可学习参数的最优性能。此外,我们提出使用能够捕捉分子质量的指标,这些指标超越了局部化学价态约束,朝着更高阶的结构基序发展。这些指标表明,即使满足了基本约束,模型在训练数据分布之外仍倾向于产生不寻常且可能有问题的官能团。用于重现这项工作的代码和训练模型可在https://github.com/dunni3/FlowMol上获取。