• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

将SMILES与化学环境感知令牌进行杂交以提高分子结构生成的性能。

Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation.

作者信息

Han Herim, Yeom Min Sun, Choi Sunghwan

机构信息

NamuICT R&D Center, NamuICT, 41 Magok Jungang 8-ro, Seoul, 07793, Republic of Korea.

Department of Chemistry, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea.

出版信息

Sci Rep. 2025 May 15;15(1):16892. doi: 10.1038/s41598-025-01890-7.

DOI:10.1038/s41598-025-01890-7
PMID:40374848
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12081657/
Abstract

The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.

摘要

简化分子输入线性输入系统(SMILES)是应用最为广泛的分子表示方法之一。然而,SMILES符号存在令牌多样性有限以及单个令牌内缺乏化学信息的问题。为解决这些局限性并同时保持其简单性,我们提出了一种分子表示方法,即将标准SMILES令牌与“SMILES中的原子”(AIS)令牌进行混合,后者将局部化学环境信息整合到单个令牌中。这种混合表示法称为SMI + AIS,它结合了SMILES和AIS令牌,使AIS令牌能够根据化学上下文区分化学元素,而无需为不常见元素引入额外令牌。使用SMI + AIS表示法,我们通过比较基于潜在空间优化的化学结构生成中生成结构的预定义指标来评估其性能。与标准SMILES相比,SMI + AIS的结合亲和力提高了7%,合成性提高了6%,突出了其在增强基于机器学习的分子设计中的效用。我们的结果表明,SMI + AIS表示法提供了一种更有效且信息更丰富的方法来封装化学上下文,并在化学中的其他机器学习任务中具有性能提升的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/fc22641c5afe/41598_2025_1890_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/dd1ad849ea1d/41598_2025_1890_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/401317408fcc/41598_2025_1890_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/8a744b910c8e/41598_2025_1890_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/b27114123fe6/41598_2025_1890_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/3ad6b6a13988/41598_2025_1890_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/fc22641c5afe/41598_2025_1890_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/dd1ad849ea1d/41598_2025_1890_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/401317408fcc/41598_2025_1890_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/8a744b910c8e/41598_2025_1890_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/b27114123fe6/41598_2025_1890_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/3ad6b6a13988/41598_2025_1890_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f227/12081657/fc22641c5afe/41598_2025_1890_Fig6_HTML.jpg

相似文献

1
Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation.将SMILES与化学环境感知令牌进行杂交以提高分子结构生成的性能。
Sci Rep. 2025 May 15;15(1):16892. doi: 10.1038/s41598-025-01890-7.
2
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.通过SMILES中的原子分词提高化学语言模型结果的质量。
J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.
3
Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质
J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.
4
XSMILES: interactive visualization for molecules, SMILES and XAI attribution scores.XSMILES:用于分子、SMILES和可解释人工智能归因分数的交互式可视化。
J Cheminform. 2023 Jan 6;15(1):2. doi: 10.1186/s13321-022-00673-w.
5
SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning.SMILES 对编码:一种用于深度学习的数据驱动子结构标记化算法。
J Chem Inf Model. 2021 Apr 26;61(4):1560-1569. doi: 10.1021/acs.jcim.0c01127. Epub 2021 Mar 14.
6
Improving Chemical Autoencoder Latent Space and Molecular Generation Diversity with Heteroencoders.用异构图编码器改进化学自动编码器潜在空间和分子生成多样性。
Biomolecules. 2018 Oct 30;8(4):131. doi: 10.3390/biom8040131.
7
MERMAID: an open source automated hit-to-lead method based on deep reinforcement learning.MERMAID:一种基于深度强化学习的开源自动化从命中到先导物的方法。
J Cheminform. 2021 Nov 27;13(1):94. doi: 10.1186/s13321-021-00572-6.
8
Generative Pre-trained Transformer (GPT) based model with relative attention for de novo drug design.基于生成式预训练转换器(GPT)的相对注意力模型在从头设计药物中的应用。
Comput Biol Chem. 2023 Oct;106:107911. doi: 10.1016/j.compbiolchem.2023.107911. Epub 2023 Jun 29.
9
fragSMILES as a chemical string notation for advanced fragment and chirality representation.fragSMILES作为一种用于高级片段和手性表示的化学字符串表示法。
Commun Chem. 2025 Jan 29;8(1):26. doi: 10.1038/s42004-025-01423-3.
10
Can large language models understand molecules?大语言模型能理解分子吗?
BMC Bioinformatics. 2024 Jun 26;25(1):225. doi: 10.1186/s12859-024-05847-x.

本文引用的文献

1
Multi-objective latent space optimization of generative molecular design models.生成式分子设计模型的多目标潜在空间优化
Patterns (N Y). 2024 Aug 12;5(10):101042. doi: 10.1016/j.patter.2024.101042. eCollection 2024 Oct 11.
2
Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.比较 SMILES 和 SELFIES 标记化以增强化学语言建模。
Sci Rep. 2024 Oct 23;14(1):25016. doi: 10.1038/s41598-024-76440-8.
3
Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.用于药物发现中 ADMET 预测的混合片段 SMILES 标记化。
BMC Bioinformatics. 2024 Aug 1;25(1):255. doi: 10.1186/s12859-024-05861-z.
4
ADMET-AI: a machine learning ADMET platform for evaluation of large-scale chemical libraries.ADMET-AI:用于评估大规模化学文库的机器学习 ADMET 平台。
Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae416.
5
t-SMILES: a fragment-based molecular representation framework for de novo ligand design.t-SMILES:一种用于从头设计配体的基于片段的分子表示框架。
Nat Commun. 2024 Jun 11;15(1):4993. doi: 10.1038/s41467-024-49388-6.
6
admetSAR3.0: a comprehensive platform for exploration, prediction and optimization of chemical ADMET properties.admetSAR3.0:一个全面的用于探索、预测和优化化学 ADMET 性质的平台。
Nucleic Acids Res. 2024 Jul 5;52(W1):W432-W438. doi: 10.1093/nar/gkae298.
7
ADMETlab 3.0: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support.ADMETlab 3.0:一个更新的全面在线 ADMET 预测平台,具有更广泛的覆盖范围、更高的性能、API 功能和决策支持。
Nucleic Acids Res. 2024 Jul 5;52(W1):W422-W431. doi: 10.1093/nar/gkae236.
8
Design of target specific peptide inhibitors using generative deep learning and molecular dynamics simulations.使用生成式深度学习和分子动力学模拟设计靶向特定肽抑制剂。
Nat Commun. 2024 Feb 21;15(1):1611. doi: 10.1038/s41467-024-45766-2.
9
An invertible, invariant crystal representation for inverse design of solid-state materials using generative deep learning.一种用于固态材料逆设计的可逆、不变晶体表示,采用生成式深度学习。
Nat Commun. 2023 Nov 2;14(1):7027. doi: 10.1038/s41467-023-42870-7.
10
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization.通过SMILES中的原子分词提高化学语言模型结果的质量。
J Cheminform. 2023 May 29;15(1):55. doi: 10.1186/s13321-023-00725-9.