近期深度学习化学语言模型的系统综述。

A systematic review of deep learning chemical language models in recent era.

作者信息

Flores-Hernandez Hector, Martinez-Ledesma Emmanuel

机构信息

Tecnológico de Monterrey, School of Engineering and Sciences, Monterrey, 64710, Nuevo León, México.

Tecnológico de Monterrey, School of Medicine and Health Sciences, Monterrey, 64710, Nuevo León, México.

出版信息

J Cheminform. 2024 Nov 18;16(1):129. doi: 10.1186/s13321-024-00916-y.

DOI:10.1186/s13321-024-00916-y

PMID:39558376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11571686/

Abstract

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

摘要

发现具有特定性质的新化合物可为依赖材料进行发展的领域带来优势，尽管这项任务在复杂性和资源方面成本高昂。自数据时代开始以来，深度学习技术通过对分子数据表示进行分析和学习，彻底改变了分子设计过程，极大地减少了所需的资源和时间。迄今为止，已经开发了各种深度学习方法，使用了多种架构和策略，以探索广阔且不连续的化学空间，为生成具有特定性质的化合物提供了便利。在本研究中，我们进行了一项系统综述，利用分子集（MOSES）或Guacamol中提出的指标，对通过深度学习技术生成分子所采用的策略进行统计描述和比较。该研究包括通过对Scopus和Web of Science进行基于查询的搜索检索到的48篇文章，以及通过引文搜索检索到的25篇文章，总共检索到72篇文章，其中62篇对应于分子生成的化学语言模型方法，另外10篇检索到的文章对应于分子图表示。在检索到的文章集中，Transformer、循环神经网络（RNN）、生成对抗网络（GAN）、结构化空间状态序列（S4）模型和变分自编码器（VAE）被认为是用于分子生成的主要深度学习架构。此外，迁移学习、强化学习和条件学习是用于有偏模型生成和特定化学空间区域探索的最常用技术。最后，本分析聚焦于分子表示、数据库、训练数据集大小、有效性 - 新颖性权衡以及无偏和有偏化学语言模型性能等核心主题。选择这些主题是为了利用图形表示和统计测试进行统计分析。由此产生的分析揭示了过去四年化学语言模型领域的主要挑战、优势和机遇。