• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

IUPAC-GPT:一种基于国际纯粹与应用化学联合会(IUPAC)的大规模分子预训练模型,用于性质预测和分子生成。

IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.

作者信息

Mao Jiashun, Sui Tang, Cho Kwang-Hwi, No Kyoung Tai, Wang Jianmin, Shan Dongjing

机构信息

School of Medical Information and Engineering, Southwest Medical University, Luzhou, 610199, China.

Department of Integrative Biotechnology, Yonsei University, Incheon, 21983, Korea.

出版信息

Mol Divers. 2025 Jul 3. doi: 10.1007/s11030-025-11280-w.

DOI:10.1007/s11030-025-11280-w
PMID:40608231
Abstract

The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.

摘要

国际纯粹与应用化学联合会(IUPAC)命名法是一种普遍认可的为化合物命名的标准命名系统,是一种便于人类使用的亚结构分子语言。简化分子输入线性条目系统(SMILES)字符串是目前最流行的分子表示语言,是一种便于计算机使用的原子级分子语言。考虑到IUPAC名称的可读性以及SMILES字符串的优势,研究这两种分子语言在分子生成以及回归/分类任务方面的差异具有重要意义。因此,我们开发了一种名为IUPAC-GPT的化学语言模型。除了分子生成,我们还纳入了IUPAC-GPT模型参数的冻结以及可训练轻量级网络的附加,以微调回归/分类任务。结果表明,预训练的IUPAC-GPT能够掌握可有效转移到下游任务(如分子生成、二元分类和性质回归预测)的一般知识。此外,在使用相同配置时,IUPAC-GPT在某些性质预测任务方面比smilesGPT模型表现更优。总体而言,在IUPAC语料库上预训练的类似Transformer的语言模型成为有前景的替代方案,与在SMILES语料库上预训练的模型相比,在可解释性和语义抽象(化学基团和修饰)方面表现更优。

相似文献

1
IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.IUPAC-GPT:一种基于国际纯粹与应用化学联合会(IUPAC)的大规模分子预训练模型,用于性质预测和分子生成。
Mol Divers. 2025 Jul 3. doi: 10.1007/s11030-025-11280-w.
2
Systemic treatments for metastatic cutaneous melanoma.转移性皮肤黑色素瘤的全身治疗
Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.
3
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
4
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
5
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
6
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
7
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
8
Large Language Model Architectures in Health Care: Scoping Review of Research Perspectives.医疗保健中的大语言模型架构:研究视角的范围综述
J Med Internet Res. 2025 Jun 19;27:e70315. doi: 10.2196/70315.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
10
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

引用本文的文献

1
Diffusion-based generative drug-like molecular editing with chemical natural language.基于扩散的类药物分子生成式编辑与化学自然语言
J Pharm Anal. 2025 Jun;15(6):101137. doi: 10.1016/j.jpha.2024.101137. Epub 2024 Feb 11.

本文引用的文献

1
's diabetes secrets: A comprehensive review of cellular, molecular, and epigenetic targets with immune modulation and microbiome influence.糖尿病的奥秘:对细胞、分子和表观遗传靶点以及免疫调节和微生物组影响的全面综述。
J Pharm Anal. 2025 May;15(5):101130. doi: 10.1016/j.jpha.2024.101130. Epub 2024 Oct 28.
2
Transformer-Based Molecular Generative Model for Antiviral Drug Design.基于 Transformer 的抗病毒药物设计分子生成模型。
J Chem Inf Model. 2024 Apr 8;64(7):2733-2745. doi: 10.1021/acs.jcim.3c00536. Epub 2023 Jun 27.
3
De novo molecular design with deep molecular generative models for PPI inhibitors.
基于深度分子生成模型从头设计蛋白质-蛋白质相互作用抑制剂。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac285.
4
Quantitative Estimate Index for Early-Stage Screening of Compounds Targeting Protein-Protein Interactions.定量估计指数用于针对蛋白质-蛋白质相互作用的化合物的早期筛选。
Int J Mol Sci. 2021 Oct 10;22(20):10925. doi: 10.3390/ijms222010925.
5
A novel framework integrating AI model and enzymological experiments promotes identification of SARS-CoV-2 3CL protease inhibitors and activity-based probe.一种新型的人工智能模型和酶学实验整合框架促进了 SARS-CoV-2 3CL 蛋白酶抑制剂和基于活性探针的鉴定。
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab301.
6
Transformer-based artificial neural networks for the conversion between chemical notations.基于 Transformer 的人工神经网络在化学标记物转换中的应用。
Sci Rep. 2021 Jul 20;11(1):14798. doi: 10.1038/s41598-021-94082-y.
7
COCONUT online: Collection of Open Natural Products database.COCONUT在线:开放天然产物数据库集合。
J Cheminform. 2021 Jan 10;13(1):2. doi: 10.1186/s13321-020-00478-9.
8
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models.分子集(MOSES):分子生成模型的基准测试平台。
Front Pharmacol. 2020 Dec 18;11:565644. doi: 10.3389/fphar.2020.565644. eCollection 2020.
9
GuacaMol: Benchmarking Models for de Novo Molecular Design.GuacaMol:从头设计分子的模型基准测试。
J Chem Inf Model. 2019 Mar 25;59(3):1096-1108. doi: 10.1021/acs.jcim.8b00839. Epub 2019 Mar 19.
10
Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery.Fréchet ChemNet 距离:药物发现中分子生成模型的一种度量。
J Chem Inf Model. 2018 Sep 24;58(9):1736-1741. doi: 10.1021/acs.jcim.8b00234. Epub 2018 Aug 28.