t-SMILES：一种用于从头设计配体的基于片段的分子表示框架。

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.

作者信息

Wu Juan-Ni, Wang Tong, Chen Yue, Tang Li-Juan, Wu Hai-Long, Yu Ru-Qin

机构信息

State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, PR China.

出版信息

Nat Commun. 2024 Jun 11;15(1):4993. doi: 10.1038/s41467-024-49388-6.

DOI:10.1038/s41467-024-49388-6

PMID:38862578

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11167009/

Abstract

Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA (t-SMILES with shared atom), TSDY (t-SMILES with dummy atom but without ID) and TSID (t-SMILES with ID and dummy atom). It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility of constructing a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. In addition, it can avoid overfitting and achieve higher novelty scores while maintaining reasonable similarity on labeled low-resource datasets, regardless of whether the model is original, data-augmented, or pre-trained then fine-tuned. Furthermore, it significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. And it surpasses state-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9.

摘要

分子的有效表示是影响人工智能模型性能的关键因素。本研究引入了一种灵活的、基于片段的多尺度分子表示框架，称为t-SMILES（基于树的SMILES），它具有三种编码算法：TSSA（具有共享原子的t-SMILES）、TSDY（具有虚拟原子但无ID的t-SMILES）和TSID（具有ID和虚拟原子的t-SMILES）。它使用通过对由碎片化分子图形成的完全二叉树进行广度优先搜索而获得的SMILES类型字符串来描述分子。使用JTVAE、BRICS、MMPA和Scaffold进行的系统评估表明构建多编码分子描述系统的可行性，其中各种描述相互补充，提高了整体性能。此外，它可以避免过拟合，并在标记的低资源数据集上保持合理相似性的同时获得更高的新颖性分数，无论模型是原始的、数据增强的还是预训练后微调的。此外，在目标导向任务中，它显著优于经典的SMILES、DeepSMILES、SELFIES和基线模型。并且在ChEMBL、Zinc和QM9上，它超越了基于片段、图和SMILES的最新方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cc9/11167009/5b2a39114239/41467_2024_49388_Fig1_HTML.jpg

相似文献

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.

Nat Commun. 2024 Jun 11;15(1):4993. doi: 10.1038/s41467-024-49388-6.

Multi-objective de novo drug design with conditional graph generative model.

J Cheminform. 2018 Jul 24;10(1):33. doi: 10.1186/s13321-018-0287-6.

SMILES-based deep generative scaffold decorator for de-novo drug design.

J Cheminform. 2020 May 29;12(1):38. doi: 10.1186/s13321-020-00441-8.

NoiseMol: A noise-robusted data augmentation via perturbing noise for molecular property prediction.

J Mol Graph Model. 2023 Jun;121:108454. doi: 10.1016/j.jmgm.2023.108454. Epub 2023 Mar 15.

Randomized SMILES strings improve the quality of molecular generative models.

J Cheminform. 2019 Nov 21;11(1):71. doi: 10.1186/s13321-019-0393-0.

UnCorrupt SMILES: a novel approach to de novo design.

J Cheminform. 2023 Feb 14;15(1):22. doi: 10.1186/s13321-023-00696-x.

De Novo Molecule Design by Translating from Reduced Graphs to SMILES.

J Chem Inf Model. 2019 Mar 25;59(3):1136-1146. doi: 10.1021/acs.jcim.8b00626. Epub 2018 Dec 21.

Masked graph modeling for molecule generation.

Nat Commun. 2021 May 26;12(1):3156. doi: 10.1038/s41467-021-23415-2.

MultiGran-SMILES: multi-granularity SMILES learning for molecular property prediction.

Bioinformatics. 2022 Sep 30;38(19):4573-4580. doi: 10.1093/bioinformatics/btac550.

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules.

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab327.

引用本文的文献

Enhancing deep chemical reaction prediction with advanced chirality and fragment representation.

Chem Commun (Camb). 2025 Sep 11. doi: 10.1039/d5cc02641e.

Representation of Molecules by Sequences of Instructions.

J Chem Inf Model. 2025 Aug 11;65(15):7936-7955. doi: 10.1021/acs.jcim.5c00354. Epub 2025 Jul 28.

Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation.

Sci Rep. 2025 May 15;15(1):16892. doi: 10.1038/s41598-025-01890-7.

CGsmiles: A Versatile Line Notation for Molecular Representations across Multiple Resolutions.

J Chem Inf Model. 2025 Apr 14;65(7):3405-3419. doi: 10.1021/acs.jcim.5c00064. Epub 2025 Mar 24.

fragSMILES as a chemical string notation for advanced fragment and chirality representation.

Commun Chem. 2025 Jan 29;8(1):26. doi: 10.1038/s42004-025-01423-3.

A hitchhiker's guide to deep chemical language processing for bioactivity prediction.

Digit Discov. 2024 Dec 16;4(2):316-325. doi: 10.1039/d4dd00311j. eCollection 2025 Feb 12.

本文引用的文献

SELFIES and the future of molecular string representations.

Patterns (N Y). 2022 Oct 14;3(10):100588. doi: 10.1016/j.patter.2022.100588.

Language models can learn complex molecular distributions.

Nat Commun. 2022 Jun 7;13(1):3293. doi: 10.1038/s41467-022-30839-x.

Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting.

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):657-668. doi: 10.1109/TPAMI.2022.3154319. Epub 2022 Dec 5.

Generating reaction trees with cascaded variational autoencoders.

J Chem Phys. 2022 Jan 28;156(4):044117. doi: 10.1063/5.0076749.

Molecular generation by Fast Assembly of (Deep)SMILES fragments.

J Cheminform. 2021 Nov 14;13(1):88. doi: 10.1186/s13321-021-00566-4.

MolGPT: Molecular Generation Using a Transformer-Decoder Model.

J Chem Inf Model. 2022 May 9;62(9):2064-2076. doi: 10.1021/acs.jcim.1c00600. Epub 2021 Oct 25.

Masked graph modeling for molecule generation.

Nat Commun. 2021 May 26;12(1):3156. doi: 10.1038/s41467-021-23415-2.

Randomized SMILES strings improve the quality of molecular generative models.

J Cheminform. 2019 Nov 21;11(1):71. doi: 10.1186/s13321-019-0393-0.

CReM: chemically reasonable mutations framework for structure generation.

J Cheminform. 2020 Apr 22;12(1):28. doi: 10.1186/s13321-020-00431-w.

Graph-based generative models for de Novo drug design.

Drug Discov Today Technol. 2019 Dec;32-33:45-53. doi: 10.1016/j.ddtec.2020.11.004. Epub 2020 Nov 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

t-SMILES：一种用于从头设计配体的基于片段的分子表示框架。

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.

作者信息

Wu Juan-Ni, Wang Tong, Chen Yue, Tang Li-Juan, Wu Hai-Long, Yu Ru-Qin

机构信息

State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, PR China.

出版信息

Nat Commun. 2024 Jun 11;15(1):4993. doi: 10.1038/s41467-024-49388-6.

DOI:10.1038/s41467-024-49388-6

PMID:38862578

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11167009/

Abstract

摘要

t-SMILES：一种用于从头设计配体的基于片段的分子表示框架。

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

t-SMILES：一种用于从头设计配体的基于片段的分子表示框架。

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献