Suppr超能文献

IUPAC-GPT:一种基于国际纯粹与应用化学联合会(IUPAC)的大规模分子预训练模型,用于性质预测和分子生成。

IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.

作者信息

Mao Jiashun, Sui Tang, Cho Kwang-Hwi, No Kyoung Tai, Wang Jianmin, Shan Dongjing

机构信息

School of Medical Information and Engineering, Southwest Medical University, Luzhou, 610199, China.

Department of Integrative Biotechnology, Yonsei University, Incheon, 21983, Korea.

出版信息

Mol Divers. 2025 Jul 3. doi: 10.1007/s11030-025-11280-w.

Abstract

The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.

摘要

国际纯粹与应用化学联合会(IUPAC)命名法是一种普遍认可的为化合物命名的标准命名系统,是一种便于人类使用的亚结构分子语言。简化分子输入线性条目系统(SMILES)字符串是目前最流行的分子表示语言,是一种便于计算机使用的原子级分子语言。考虑到IUPAC名称的可读性以及SMILES字符串的优势,研究这两种分子语言在分子生成以及回归/分类任务方面的差异具有重要意义。因此,我们开发了一种名为IUPAC-GPT的化学语言模型。除了分子生成,我们还纳入了IUPAC-GPT模型参数的冻结以及可训练轻量级网络的附加,以微调回归/分类任务。结果表明,预训练的IUPAC-GPT能够掌握可有效转移到下游任务(如分子生成、二元分类和性质回归预测)的一般知识。此外,在使用相同配置时,IUPAC-GPT在某些性质预测任务方面比smilesGPT模型表现更优。总体而言,在IUPAC语料库上预训练的类似Transformer的语言模型成为有前景的替代方案,与在SMILES语料库上预训练的模型相比,在可解释性和语义抽象(化学基团和修饰)方面表现更优。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验