语言模型可以学习复杂的分子分布。

Language models can learn complex molecular distributions.

机构信息

Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.

Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada.

出版信息

Nat Commun. 2022 Jun 7;13(1):3293. doi: 10.1038/s41467-022-30839-x.

DOI:10.1038/s41467-022-30839-x

PMID:35672310

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9174447/

Abstract

Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds, depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. Since their initial use, subsequent work has shown that language models are very capable, in particular, recent research has demonstrated their utility in the low data regime. In this work, we investigate the capacity of simple language models to learn more complex distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and we evaluate the ability of language models on each task. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem. The results highlight the limitations of some of the most popular and recent graph generative models- many of which cannot scale to these molecular distributions.

摘要

分子的深度生成模型已经变得非常流行，这些模型经过相关数据集的训练，用于在化学空间中进行搜索。生成模型对于新型功能化合物的反向设计的下游应用，取决于它们学习训练分子分布的能力。最简单的例子是一种语言模型，它采用递归神经网络的形式，使用字符串表示生成分子。自从它们最初被使用以来，随后的工作表明语言模型非常有能力，特别是最近的研究表明它们在数据量较少的情况下非常有用。在这项工作中，我们研究了简单语言模型学习更复杂分子分布的能力。为此，我们通过编译更大、更复杂的分子分布来引入几个具有挑战性的生成建模任务，并在每个任务上评估语言模型的能力。结果表明，语言模型是强大的生成模型，能够熟练地学习复杂的分子分布。语言模型可以准确地生成：ZINC15 中得分最高的惩罚 LogP 分子的分布、多峰分子分布以及 PubChem 中最大的分子。这些结果突出了一些最流行和最近的图生成模型的局限性——其中许多模型无法扩展到这些分子分布。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c8a/9174447/c88dafa06ada/41467_2022_30839_Fig1_HTML.jpg

相似文献

Language models can learn complex molecular distributions.

Nat Commun. 2022 Jun 7;13(1):3293. doi: 10.1038/s41467-022-30839-x.

Molecular language models: RNNs or transformer?

Brief Funct Genomics. 2023 Jul 17;22(4):392-400. doi: 10.1093/bfgp/elad012.

Training recurrent neural networks as generative neural networks for molecular structures: how does it impact drug discovery?

Expert Opin Drug Discov. 2022 Oct;17(10):1071-1079. doi: 10.1080/17460441.2023.2134340. Epub 2022 Oct 17.

Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph.

Comput Struct Biotechnol J. 2024 Apr 12;23:1666-1679. doi: 10.1016/j.csbj.2024.04.030. eCollection 2024 Dec.

Comparative Study of Deep Generative Models on Chemical Space Coverage.

J Chem Inf Model. 2021 Jun 28;61(6):2572-2581. doi: 10.1021/acs.jcim.0c01328. Epub 2021 May 20.

Adversarial Threshold Neural Computer for Molecular de Novo Design.

Mol Pharm. 2018 Oct 1;15(10):4386-4397. doi: 10.1021/acs.molpharmaceut.7b01137. Epub 2018 Mar 30.

Network-principled deep generative models for designing drug combinations as graph sets.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i445-i454. doi: 10.1093/bioinformatics/btaa317.

Generative Deep Learning for Targeted Compound Design.

J Chem Inf Model. 2021 Nov 22;61(11):5343-5361. doi: 10.1021/acs.jcim.0c01496. Epub 2021 Oct 26.

De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning.

J Comput Aided Mol Des. 2024 Apr 22;38(1):20. doi: 10.1007/s10822-024-00559-z.

Chemical language modeling with structured state space sequence models.

Nat Commun. 2024 Jul 22;15(1):6176. doi: 10.1038/s41467-024-50469-9.

引用本文的文献

Emulating sensation by bridging neuromorphic computing and multisensory integration.

Patterns (N Y). 2025 Apr 29;6(7):101238. doi: 10.1016/j.patter.2025.101238. eCollection 2025 Jul 11.

Going beyond SMILES enumeration for data augmentation in generative drug discovery.

Digit Discov. 2025 Aug 14. doi: 10.1039/d5dd00028a.

Optimizing drug design by merging generative AI with a physics-based active learning framework.

Commun Chem. 2025 Aug 8;8(1):238. doi: 10.1038/s42004-025-01635-7.

Leveraging tree-transformer VAE with fragment tokenization for high-performance large chemical model generation.

Commun Chem. 2025 Aug 5;8(1):228. doi: 10.1038/s42004-025-01640-w.

Benchmarking 3D Structure-Based Molecule Generators.

J Chem Inf Model. 2025 Aug 11;65(15):8006-8021. doi: 10.1021/acs.jcim.5c01020. Epub 2025 Jul 25.

Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.

J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.

An open-source family of large encoder-decoder foundation models for chemistry.

Commun Chem. 2025 Jul 1;8(1):193. doi: 10.1038/s42004-025-01585-0.

AI-HOPE: an AI-driven conversational agent for enhanced clinical and genomic data integration in precision medicine research.

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf359.

Multi-objective drug design with a scaffold-aware variational autoencoder.

Chem Sci. 2025 Jun 25. doi: 10.1039/d4sc08736d.

Generative Adversarial Model-Based Optimization via Source Critic Regularization.

Adv Neural Inf Process Syst. 2024;37:44009-44039.

本文引用的文献

Deep generative models for ligand-based de novo design applied to multi-parametric optimization.

J Comput Chem. 2022 Apr 15;43(10):692-703. doi: 10.1002/jcc.26826. Epub 2022 Feb 26.

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Combining generative artificial intelligence and on-chip synthesis for de novo drug design.

Sci Adv. 2021 Jun 11;7(24). doi: 10.1126/sciadv.abg3338. Print 2021 Jun.

Masked graph modeling for molecule generation.

Nat Commun. 2021 May 26;12(1):3156. doi: 10.1038/s41467-021-23415-2.

Natural products in drug discovery: advances and opportunities.

Nat Rev Drug Discov. 2021 Mar;20(3):200-216. doi: 10.1038/s41573-020-00114-z. Epub 2021 Jan 28.

Randomized SMILES strings improve the quality of molecular generative models.

J Cheminform. 2019 Nov 21;11(1):71. doi: 10.1186/s13321-019-0393-0.

COCONUT online: Collection of Open Natural Products database.

J Cheminform. 2021 Jan 10;13(1):2. doi: 10.1186/s13321-020-00478-9.

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models.

Front Pharmacol. 2020 Dec 18;11:565644. doi: 10.3389/fphar.2020.565644. eCollection 2020.

On failure modes in molecule generation and optimization.

Drug Discov Today Technol. 2019 Dec;32-33:55-63. doi: 10.1016/j.ddtec.2020.09.003. Epub 2020 Oct 24.

SciPy 1.0: fundamental algorithms for scientific computing in Python.

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

语言模型可以学习复杂的分子分布。

Language models can learn complex molecular distributions.

机构信息

Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.

Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada.

出版信息

Nat Commun. 2022 Jun 7;13(1):3293. doi: 10.1038/s41467-022-30839-x.

DOI:10.1038/s41467-022-30839-x

PMID:35672310

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9174447/

Abstract

摘要

语言模型可以学习复杂的分子分布。

Language models can learn complex molecular distributions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

语言模型可以学习复杂的分子分布。

Language models can learn complex molecular distributions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献