用于化学SMILES表示的基于Transformer的模型：全面的文献综述。

Transformer-based models for chemical SMILES representation: A comprehensive literature review.

作者信息

Mswahili Medard Edmund, Jeong Young-Seob

机构信息

Chungbuk National University, Department of Computer Engineering, Cheongju, 28644, South Korea.

出版信息

Heliyon. 2024 Oct 9;10(20):e39038. doi: 10.1016/j.heliyon.2024.e39038. eCollection 2024 Oct 30.

DOI:10.1016/j.heliyon.2024.e39038

PMID:39640612

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11620068/

Abstract

Pre-trained chemical language models (CLMs) have attracted increasing attention within the domains of cheminformatics and bioinformatics, inspired by their remarkable success in the natural language processing (NLP) domain such as speech recognition, text analysis, translation, and other objectives associated with language. Furthermore, the vast amount of unlabeled data associated with chemical compounds or molecules has emerged as a crucial research focus, prompting the need for CLMs with reasoning capabilities over such data. Molecular graphs and molecular descriptors are the predominant approaches to representing molecules for property prediction in machine learning (ML). However, Transformer-based LMs have recently emerged as de-facto powerful tools in deep learning (DL), showcasing outstanding performance across various NLP downstream tasks, particularly in text analysis. Within the realm of pre-trained transformer-based LMs such as, BERT (and its variants) and GPT (and its variants) have been extensively explored in the chemical informatics domain. Various learning tasks in cheminformatics such as the text analysis that necessitate handling of chemical SMILES data which contains intricate relations among elements or atoms, have become increasingly prevalent. Whether the objective is predicting molecular reactions or molecular property prediction, there is a growing demand for LMs capable of learning molecular contextual information within SMILES sequences or strings from text inputs (i.e., SMILES). This review provides an overview of the current state-of-the-art of chemical language Transformer-based LMs in chemical informatics for de novo design, and analyses current limitations, challenges, and advantages. Finally, a perspective on future opportunities is provided in this evolving field.

摘要

预训练化学语言模型（CLMs）在化学信息学和生物信息学领域受到了越来越多的关注，这得益于它们在自然语言处理（NLP）领域（如语音识别、文本分析、翻译以及与语言相关的其他目标）取得的显著成功。此外，与化合物或分子相关的大量未标记数据已成为关键的研究重点，这促使人们需要具有对这些数据进行推理能力的CLMs。分子图和分子描述符是机器学习（ML）中用于表示分子以进行性质预测的主要方法。然而，基于Transformer的语言模型（LMs）最近已成为深度学习（DL）中事实上的强大工具，在各种NLP下游任务中展现出卓越性能，尤其是在文本分析方面。在诸如BERT（及其变体）和GPT（及其变体）等基于Transformer的预训练LMs领域，已经在化学信息学领域进行了广泛探索。化学信息学中的各种学习任务，如需要处理包含元素或原子之间复杂关系的化学SMILES数据的文本分析，变得越来越普遍。无论目标是预测分子反应还是分子性质预测，对能够从文本输入（即SMILES）的SMILES序列或字符串中学习分子上下文信息的LMs的需求都在不断增长。本综述概述了基于化学语言Transformer的LMs在化学信息学中用于从头设计的当前技术水平，并分析了当前的局限性、挑战和优势。最后，对这个不断发展的领域的未来机会提供了一个展望。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/b7c650bde6ee/gr001.jpg

相似文献

Transformer-based models for chemical SMILES representation: A comprehensive literature review.

Heliyon. 2024 Oct 9;10(20):e39038. doi: 10.1016/j.heliyon.2024.e39038. eCollection 2024 Oct 30.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

Can large language models understand molecules?

BMC Bioinformatics. 2024 Jun 26;25(1):225. doi: 10.1186/s12859-024-05847-x.

Molecular Descriptors Property Prediction Using Transformer-Based Approach.

Int J Mol Sci. 2023 Jul 26;24(15):11948. doi: 10.3390/ijms241511948.

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration.

Research (Wash D C). 2022 Dec 15;2022:0004. doi: 10.34133/research.0004. eCollection 2022.

Application of Transformers in Cheminformatics.

J Chem Inf Model. 2024 Jun 10;64(11):4392-4409. doi: 10.1021/acs.jcim.3c02070. Epub 2024 May 30.

MolGPT: Molecular Generation Using a Transformer-Decoder Model.

J Chem Inf Model. 2022 May 9;62(9):2064-2076. doi: 10.1021/acs.jcim.1c00600. Epub 2021 Oct 25.

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence.

J Cheminform. 2024 Jun 19;16(1):71. doi: 10.1186/s13321-024-00848-7.

Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.

JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.

Screening of multi deep learning-based de novo molecular generation models and their application for specific target molecular generation.

Sci Rep. 2025 Feb 5;15(1):4419. doi: 10.1038/s41598-025-86840-z.

引用本文的文献

Machine learning analysis of ARVC informed by sodium channel protein-based interactome networks.

Front Pharmacol. 2025 Jul 23;16:1611342. doi: 10.3389/fphar.2025.1611342. eCollection 2025.

Transformer-based deep learning enables improved B-cell epitope prediction in parasitic pathogens: A proof-of-concept study on Fasciola hepatica.

PLoS Negl Trop Dis. 2025 Apr 29;19(4):e0012985. doi: 10.1371/journal.pntd.0012985. eCollection 2025 Apr.

New Benzothiazole-Monoterpenoid Hybrids as Multifunctional Molecules with Potential Applications in Cosmetics.

Molecules. 2025 Jan 31;30(3):636. doi: 10.3390/molecules30030636.

本文引用的文献

Emerging opportunities of using large language models for translation between drug molecules and indications.

Sci Rep. 2024 May 10;14(1):10738. doi: 10.1038/s41598-024-61124-0.

Molecular Descriptors Property Prediction Using Transformer-Based Approach.

Int J Mol Sci. 2023 Jul 26;24(15):11948. doi: 10.3390/ijms241511948.

Generative Pre-trained Transformer (GPT) based model with relative attention for de novo drug design.

Comput Biol Chem. 2023 Oct;106:107911. doi: 10.1016/j.compbiolchem.2023.107911. Epub 2023 Jun 29.

Transformer-Based Molecular Generative Model for Antiviral Drug Design.

J Chem Inf Model. 2024 Apr 8;64(7):2733-2745. doi: 10.1021/acs.jcim.3c00536. Epub 2023 Jun 27.

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.

J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.

Computational approaches streamlining drug discovery.

Nature. 2023 Apr;616(7958):673-685. doi: 10.1038/s41586-023-05905-z. Epub 2023 Apr 26.

Chemical language models for de novo drug design: Challenges and opportunities.

Curr Opin Struct Biol. 2023 Apr;79:102527. doi: 10.1016/j.sbi.2023.102527. Epub 2023 Feb 2.

X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis.

Sci Bull (Beijing). 2022 May 15;67(9):899-902. doi: 10.1016/j.scib.2022.01.029. Epub 2022 Feb 1.

PubChem 2023 update.

Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380. doi: 10.1093/nar/gkac956.

SELFIES and the future of molecular string representations.

Patterns (N Y). 2022 Oct 14;3(10):100588. doi: 10.1016/j.patter.2022.100588.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于化学SMILES表示的基于Transformer的模型：全面的文献综述。

Transformer-based models for chemical SMILES representation: A comprehensive literature review.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献