• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Token-Mol 1.0:基于大语言模型的标记化药物设计

Token-Mol 1.0: tokenized drug design with large language models.

作者信息

Wang Jike, Qin Rui, Wang Mingyang, Fang Meijing, Zhang Yangyang, Zhu Yuchen, Su Qun, Gou Qiaolin, Shen Chao, Zhang Odin, Wu Zhenxing, Jiang Dejun, Zhang Xujun, Zhao Huifeng, Ge Jingxuan, Wu Zhourui, Kang Yu, Hsieh Chang-Yu, Hou Tingjun

机构信息

College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.

Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.

出版信息

Nat Commun. 2025 May 13;16(1):4416. doi: 10.1038/s41467-025-59628-y.

DOI:10.1038/s41467-025-59628-y
PMID:40360500
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12075800/
Abstract

The integration of large language models (LLMs) into drug design is gaining momentum; however, existing approaches often struggle to effectively incorporate three-dimensional molecular structures. Here, we present Token-Mol, a token-only 3D drug design model that encodes both 2D and 3D structural information, along with molecular properties, into discrete tokens. Built on a transformer decoder and trained with causal masking, Token-Mol introduces a Gaussian cross-entropy loss function tailored for regression tasks, enabling superior performance across multiple downstream applications. The model surpasses existing methods, improving molecular conformation generation by over 10% and 20% across two datasets, while outperforming token-only models by 30% in property prediction. In pocket-based molecular generation, it enhances drug-likeness and synthetic accessibility by approximately 11% and 14%, respectively. Notably, Token-Mol operates 35 times faster than expert diffusion models. In real-world validation, it improves success rates and, when combined with reinforcement learning, further optimizes affinity and drug-likeness, advancing AI-driven drug discovery.

摘要

将大语言模型(LLMs)整合到药物设计中正在获得发展势头;然而,现有方法往往难以有效地纳入三维分子结构。在此,我们提出了Token-Mol,这是一种仅基于标记的三维药物设计模型,它将二维和三维结构信息以及分子特性编码为离散标记。基于变压器解码器构建并通过因果掩码进行训练,Token-Mol引入了专为回归任务定制的高斯交叉熵损失函数,在多个下游应用中实现了卓越性能。该模型超越了现有方法,在两个数据集上分子构象生成提高了超过10%和20%,同时在属性预测方面比仅基于标记的模型性能高出30%。在基于口袋的分子生成中,它分别将类药性和合成可及性提高了约11%和14%。值得注意的是,Token-Mol的运行速度比专家扩散模型快35倍。在实际验证中,它提高了成功率,并且与强化学习相结合时,进一步优化了亲和力和类药性,推动了人工智能驱动的药物发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a9f/12075800/0b866b23724f/41467_2025_59628_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a9f/12075800/a7738ba25174/41467_2025_59628_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a9f/12075800/0b866b23724f/41467_2025_59628_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a9f/12075800/a7738ba25174/41467_2025_59628_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a9f/12075800/0b866b23724f/41467_2025_59628_Fig2_HTML.jpg

相似文献

1
Token-Mol 1.0: tokenized drug design with large language models.Token-Mol 1.0:基于大语言模型的标记化药物设计
Nat Commun. 2025 May 13;16(1):4416. doi: 10.1038/s41467-025-59628-y.
2
3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model.3DSMILES-GPT:基于仅含标记的大语言模型的三维分子口袋生成法。
Chem Sci. 2024 Dec 4;16(2):637-648. doi: 10.1039/d4sc06864e. eCollection 2025 Jan 2.
3
DrugGen enhances drug discovery with large language models and reinforcement learning.DrugGen利用大语言模型和强化学习提升药物研发。
Sci Rep. 2025 Apr 18;15(1):13445. doi: 10.1038/s41598-025-98629-1.
4
Generative Pre-trained Transformer (GPT) based model with relative attention for de novo drug design.基于生成式预训练转换器(GPT)的相对注意力模型在从头设计药物中的应用。
Comput Biol Chem. 2023 Oct;106:107911. doi: 10.1016/j.compbiolchem.2023.107911. Epub 2023 Jun 29.
5
Distinguishing word identity and sequence context in DNA language models.在 DNA 语言模型中区分单词身份和序列上下文。
BMC Bioinformatics. 2024 Sep 13;25(1):301. doi: 10.1186/s12859-024-05869-5.
6
Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation.基于效用的统计方法和深度学习模型用于合成数据生成的分析,重点关注相关结构:算法开发与验证
JMIR AI. 2025 Mar 20;4:e65729. doi: 10.2196/65729.
7
MolGPT: Molecular Generation Using a Transformer-Decoder Model.MolGPT:基于 Transformer-Decoder 模型的分子生成。
J Chem Inf Model. 2022 May 9;62(9):2064-2076. doi: 10.1021/acs.jcim.1c00600. Epub 2021 Oct 25.
8
Token-Mixer: Bind Image and Text in One Embedding Space for Medical Image Reporting.Token-Mixer:将图像和文本绑定在一个嵌入空间中用于医疗图像报告。
IEEE Trans Med Imaging. 2024 Nov;43(11):4017-4028. doi: 10.1109/TMI.2024.3412402. Epub 2024 Nov 4.
9
Proteins Need Extra Attention: Improving the Predictive Power of Protein Language Models on Mutational Datasets with Hint Tokens.蛋白质需要额外关注:利用提示令牌提高蛋白质语言模型在突变数据集上的预测能力。
bioRxiv. 2023 Dec 7:2023.12.05.570055. doi: 10.1101/2023.12.05.570055.
10
Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders.利用二维预训练视觉变换器通过掩码自动编码器生成三维模型。
Sci Rep. 2025 Jan 25;15(1):3164. doi: 10.1038/s41598-025-87376-y.

引用本文的文献

1
Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry.关于人工智能在化学领域潜力的跨学科观点。
Chem Soc Rev. 2025 Apr 25. doi: 10.1039/d5cs00146c.

本文引用的文献

1
How Good are Current Pocket-Based 3D Generative Models?: The Benchmark Set and Evaluation of Protein Pocket-Based 3D Molecular Generative Models.当前基于口袋的3D生成模型有多好?:基于蛋白质口袋的3D分子生成模型的基准集与评估
J Chem Inf Model. 2024 Dec 23;64(24):9260-9275. doi: 10.1021/acs.jcim.4c01598. Epub 2024 Dec 4.
2
TamGen: drug design with target-aware molecule generation through a chemical language model.TamGen:通过化学语言模型实现基于靶标感知的分子生成的药物设计。
Nat Commun. 2024 Oct 29;15(1):9360. doi: 10.1038/s41467-024-53632-4.
3
Learning on topological surface and geometric structure for 3D molecular generation.
基于拓扑表面和几何结构的三维分子生成学习。
Nat Comput Sci. 2023 Oct;3(10):849-859. doi: 10.1038/s43588-023-00530-2. Epub 2023 Oct 9.
4
Chemprop: A Machine Learning Package for Chemical Property Prediction.Chemprop:一个用于化学性质预测的机器学习工具包。
J Chem Inf Model. 2024 Jan 8;64(1):9-17. doi: 10.1021/acs.jcim.3c01250. Epub 2023 Dec 26.
5
A knowledge-guided pre-training framework for improving molecular representation learning.一种基于知识引导的预训练框架,用于改进分子表示学习。
Nat Commun. 2023 Nov 21;14(1):7568. doi: 10.1038/s41467-023-43214-1.
6
The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.2023 年的 ChEMBL 数据库:一个涵盖多种生物活性数据类型和时间段的药物发现平台。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1180-D1192. doi: 10.1093/nar/gkad1004.
7
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
8
IgLM: Infilling language modeling for antibody sequence design.IgLM:抗体序列设计的填充语言模型。
Cell Syst. 2023 Nov 15;14(11):979-989.e4. doi: 10.1016/j.cels.2023.10.001. Epub 2023 Oct 30.
9
LS-MolGen: Ligand-and-Structure Dual-Driven Deep Reinforcement Learning for Target-Specific Molecular Generation Improves Binding Affinity and Novelty.LS-MolGen:基于配体和结构双重驱动的靶向特定分子生成深度强化学习方法,可提高结合亲和力和新颖性。
J Chem Inf Model. 2023 Jul 10;63(13):4207-4215. doi: 10.1021/acs.jcim.3c00587. Epub 2023 Jun 21.
10
Tora3D: an autoregressive torsion angle prediction model for molecular 3D conformation generation.Tora3D:一种用于分子三维构象生成的自回归扭转角预测模型。
J Cheminform. 2023 Jun 7;15(1):57. doi: 10.1186/s13321-023-00726-8.