• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于化学SMILES表示的基于Transformer的模型:全面的文献综述。

Transformer-based models for chemical SMILES representation: A comprehensive literature review.

作者信息

Mswahili Medard Edmund, Jeong Young-Seob

机构信息

Chungbuk National University, Department of Computer Engineering, Cheongju, 28644, South Korea.

出版信息

Heliyon. 2024 Oct 9;10(20):e39038. doi: 10.1016/j.heliyon.2024.e39038. eCollection 2024 Oct 30.

DOI:10.1016/j.heliyon.2024.e39038
PMID:39640612
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11620068/
Abstract

Pre-trained chemical language models (CLMs) have attracted increasing attention within the domains of cheminformatics and bioinformatics, inspired by their remarkable success in the natural language processing (NLP) domain such as speech recognition, text analysis, translation, and other objectives associated with language. Furthermore, the vast amount of unlabeled data associated with chemical compounds or molecules has emerged as a crucial research focus, prompting the need for CLMs with reasoning capabilities over such data. Molecular graphs and molecular descriptors are the predominant approaches to representing molecules for property prediction in machine learning (ML). However, Transformer-based LMs have recently emerged as de-facto powerful tools in deep learning (DL), showcasing outstanding performance across various NLP downstream tasks, particularly in text analysis. Within the realm of pre-trained transformer-based LMs such as, BERT (and its variants) and GPT (and its variants) have been extensively explored in the chemical informatics domain. Various learning tasks in cheminformatics such as the text analysis that necessitate handling of chemical SMILES data which contains intricate relations among elements or atoms, have become increasingly prevalent. Whether the objective is predicting molecular reactions or molecular property prediction, there is a growing demand for LMs capable of learning molecular contextual information within SMILES sequences or strings from text inputs (i.e., SMILES). This review provides an overview of the current state-of-the-art of chemical language Transformer-based LMs in chemical informatics for de novo design, and analyses current limitations, challenges, and advantages. Finally, a perspective on future opportunities is provided in this evolving field.

摘要

预训练化学语言模型(CLMs)在化学信息学和生物信息学领域受到了越来越多的关注,这得益于它们在自然语言处理(NLP)领域(如语音识别、文本分析、翻译以及与语言相关的其他目标)取得的显著成功。此外,与化合物或分子相关的大量未标记数据已成为关键的研究重点,这促使人们需要具有对这些数据进行推理能力的CLMs。分子图和分子描述符是机器学习(ML)中用于表示分子以进行性质预测的主要方法。然而,基于Transformer的语言模型(LMs)最近已成为深度学习(DL)中事实上的强大工具,在各种NLP下游任务中展现出卓越性能,尤其是在文本分析方面。在诸如BERT(及其变体)和GPT(及其变体)等基于Transformer的预训练LMs领域,已经在化学信息学领域进行了广泛探索。化学信息学中的各种学习任务,如需要处理包含元素或原子之间复杂关系的化学SMILES数据的文本分析,变得越来越普遍。无论目标是预测分子反应还是分子性质预测,对能够从文本输入(即SMILES)的SMILES序列或字符串中学习分子上下文信息的LMs的需求都在不断增长。本综述概述了基于化学语言Transformer的LMs在化学信息学中用于从头设计的当前技术水平,并分析了当前的局限性、挑战和优势。最后,对这个不断发展的领域的未来机会提供了一个展望。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/d93148fd4d5b/gr003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/b7c650bde6ee/gr001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/1493d1e73361/gr002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/d93148fd4d5b/gr003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/b7c650bde6ee/gr001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/1493d1e73361/gr002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/44e5/11620068/d93148fd4d5b/gr003.jpg

相似文献

1
Transformer-based models for chemical SMILES representation: A comprehensive literature review.用于化学SMILES表示的基于Transformer的模型:全面的文献综述。
Heliyon. 2024 Oct 9;10(20):e39038. doi: 10.1016/j.heliyon.2024.e39038. eCollection 2024 Oct 30.
2
Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质
J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.
3
Can large language models understand molecules?大语言模型能理解分子吗?
BMC Bioinformatics. 2024 Jun 26;25(1):225. doi: 10.1186/s12859-024-05847-x.
4
Molecular Descriptors Property Prediction Using Transformer-Based Approach.基于Transformer的方法进行分子描述符性质预测
Int J Mol Sci. 2023 Jul 26;24(15):11948. doi: 10.3390/ijms241511948.
5
Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration.通过SMILES枚举增强的多任务学习BERT推动药物发现中分子性质预测的边界
Research (Wash D C). 2022 Dec 15;2022:0004. doi: 10.34133/research.0004. eCollection 2022.
6
Application of Transformers in Cheminformatics.Transformer 在化学信息学中的应用。
J Chem Inf Model. 2024 Jun 10;64(11):4392-4409. doi: 10.1021/acs.jcim.3c02070. Epub 2024 May 30.
7
MolGPT: Molecular Generation Using a Transformer-Decoder Model.MolGPT:基于 Transformer-Decoder 模型的分子生成。
J Chem Inf Model. 2022 May 9;62(9):2064-2076. doi: 10.1021/acs.jcim.1c00600. Epub 2021 Oct 25.
8
A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence.一种基于BERT的预训练模型,用于从SMILES序列中提取分子结构信息。
J Cheminform. 2024 Jun 19;16(1):71. doi: 10.1186/s13321-024-00848-7.
9
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
10
Screening of multi deep learning-based de novo molecular generation models and their application for specific target molecular generation.基于深度学习的多种从头分子生成模型的筛选及其在特定目标分子生成中的应用。
Sci Rep. 2025 Feb 5;15(1):4419. doi: 10.1038/s41598-025-86840-z.

引用本文的文献

1
Machine learning analysis of ARVC informed by sodium channel protein-based interactome networks.基于钠通道蛋白相互作用组网络的致心律失常性右室心肌病机器学习分析
Front Pharmacol. 2025 Jul 23;16:1611342. doi: 10.3389/fphar.2025.1611342. eCollection 2025.
2
Transformer-based deep learning enables improved B-cell epitope prediction in parasitic pathogens: A proof-of-concept study on Fasciola hepatica.基于Transformer的深度学习可改善对寄生性病原体中B细胞表位的预测:对肝片吸虫的概念验证研究
PLoS Negl Trop Dis. 2025 Apr 29;19(4):e0012985. doi: 10.1371/journal.pntd.0012985. eCollection 2025 Apr.
3
New Benzothiazole-Monoterpenoid Hybrids as Multifunctional Molecules with Potential Applications in Cosmetics.

本文引用的文献

1
Emerging opportunities of using large language models for translation between drug molecules and indications.利用大型语言模型在药物分子和适应症之间进行翻译的新兴机会。
Sci Rep. 2024 May 10;14(1):10738. doi: 10.1038/s41598-024-61124-0.
2
Molecular Descriptors Property Prediction Using Transformer-Based Approach.基于Transformer的方法进行分子描述符性质预测
Int J Mol Sci. 2023 Jul 26;24(15):11948. doi: 10.3390/ijms241511948.
3
Generative Pre-trained Transformer (GPT) based model with relative attention for de novo drug design.
新型苯并噻唑-单萜类杂合物作为多功能分子在化妆品中的潜在应用
Molecules. 2025 Jan 31;30(3):636. doi: 10.3390/molecules30030636.
基于生成式预训练转换器(GPT)的相对注意力模型在从头设计药物中的应用。
Comput Biol Chem. 2023 Oct;106:107911. doi: 10.1016/j.compbiolchem.2023.107911. Epub 2023 Jun 29.
4
Transformer-Based Molecular Generative Model for Antiviral Drug Design.基于 Transformer 的抗病毒药物设计分子生成模型。
J Chem Inf Model. 2024 Apr 8;64(7):2733-2745. doi: 10.1021/acs.jcim.3c00536. Epub 2023 Jun 27.
5
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.通过SMILES中的原子分词提高化学语言模型结果的质量。
J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.
6
Computational approaches streamlining drug discovery.计算方法简化药物发现。
Nature. 2023 Apr;616(7958):673-685. doi: 10.1038/s41586-023-05905-z. Epub 2023 Apr 26.
7
Chemical language models for de novo drug design: Challenges and opportunities.从头开始设计药物的化学语言模型:挑战与机遇。
Curr Opin Struct Biol. 2023 Apr;79:102527. doi: 10.1016/j.sbi.2023.102527. Epub 2023 Feb 2.
8
X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis.X-MOL:用于分子理解和多样分子分析的大规模预训练
Sci Bull (Beijing). 2022 May 15;67(9):899-902. doi: 10.1016/j.scib.2022.01.029. Epub 2022 Feb 1.
9
PubChem 2023 update.PubChem 2023 更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380. doi: 10.1093/nar/gkac956.
10
SELFIES and the future of molecular string representations.自拍与分子串表示法的未来。
Patterns (N Y). 2022 Oct 14;3(10):100588. doi: 10.1016/j.patter.2022.100588.