Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.
Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada.
Nat Commun. 2022 Jun 7;13(1):3293. doi: 10.1038/s41467-022-30839-x.
Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds, depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. Since their initial use, subsequent work has shown that language models are very capable, in particular, recent research has demonstrated their utility in the low data regime. In this work, we investigate the capacity of simple language models to learn more complex distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and we evaluate the ability of language models on each task. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem. The results highlight the limitations of some of the most popular and recent graph generative models- many of which cannot scale to these molecular distributions.
分子的深度生成模型已经变得非常流行,这些模型经过相关数据集的训练,用于在化学空间中进行搜索。生成模型对于新型功能化合物的反向设计的下游应用,取决于它们学习训练分子分布的能力。最简单的例子是一种语言模型,它采用递归神经网络的形式,使用字符串表示生成分子。自从它们最初被使用以来,随后的工作表明语言模型非常有能力,特别是最近的研究表明它们在数据量较少的情况下非常有用。在这项工作中,我们研究了简单语言模型学习更复杂分子分布的能力。为此,我们通过编译更大、更复杂的分子分布来引入几个具有挑战性的生成建模任务,并在每个任务上评估语言模型的能力。结果表明,语言模型是强大的生成模型,能够熟练地学习复杂的分子分布。语言模型可以准确地生成:ZINC15 中得分最高的惩罚 LogP 分子的分布、多峰分子分布以及 PubChem 中最大的分子。这些结果突出了一些最流行和最近的图生成模型的局限性——其中许多模型无法扩展到这些分子分布。