• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个用于化学领域的开源大型编码器-解码器基础模型系列。

An open-source family of large encoder-decoder foundation models for chemistry.

作者信息

Soares Eduardo, Vital Brazil Emilio, Shirasuna Victor, Zubarev Dmitry, Cerqueira Renato, Schmidt Kristin

机构信息

IBM Research, Rio de Janeiro, Brazil.

IBM Research, Almaden, CA, USA.

出版信息

Commun Chem. 2025 Jul 1;8(1):193. doi: 10.1038/s42004-025-01585-0.

DOI:10.1038/s42004-025-01585-0
PMID:40593316
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12216393/
Abstract

The use of foundation models has extended from natural language processing to molecular modeling. In this context, large-scale pre-training strategies have been applied to chemical language models to enable representation learning across diverse tasks. Here we introduce a family of encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million molecular sequences from PubChem. These models support a range of applications, including property estimation and reaction outcome prediction. We evaluate two model variants across several benchmark datasets and show that they match or exceed existing approaches. We also assess the structure of the learned representations and find that the embedding space supports few-shot learning and separates molecules based on chemically relevant features. This structure appears to result from the decoder-based reconstruction objective used during pre-training. These findings suggest that the proposed models can serve as general-purpose tools for molecular analysis and reasoning with minimal supervision.

摘要

基础模型的应用已从自然语言处理扩展到分子建模。在此背景下,大规模预训练策略已应用于化学语言模型,以实现跨多种任务的表征学习。本文我们介绍了一族编码器-解码器化学基础模型,这些模型是在来自PubChem的9100万个分子序列的精选数据集上进行预训练的。这些模型支持一系列应用,包括性质估计和反应结果预测。我们在多个基准数据集上评估了两种模型变体,结果表明它们与现有方法相当或更优。我们还评估了学习到的表征的结构,发现嵌入空间支持少样本学习,并根据化学相关特征对分子进行区分。这种结构似乎源于预训练期间使用的基于解码器的重建目标。这些发现表明,所提出的模型可以作为在最少监督下进行分子分析和推理的通用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/4b5df9413899/42004_2025_1585_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/c9ec51cecc38/42004_2025_1585_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/87f002ce807c/42004_2025_1585_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/352b80599d49/42004_2025_1585_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/4b5df9413899/42004_2025_1585_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/c9ec51cecc38/42004_2025_1585_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/87f002ce807c/42004_2025_1585_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/352b80599d49/42004_2025_1585_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/37e1/12216393/4b5df9413899/42004_2025_1585_Fig4_HTML.jpg

相似文献

1
An open-source family of large encoder-decoder foundation models for chemistry.一个用于化学领域的开源大型编码器-解码器基础模型系列。
Commun Chem. 2025 Jul 1;8(1):193. doi: 10.1038/s42004-025-01585-0.
2
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。
Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.
3
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
4
Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果:面向临床医生的网状Meta分析教程
Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.
5
Electric fans for reducing adverse health impacts in heatwaves.用于减少热浪期间不良健康影响的电风扇。
Cochrane Database Syst Rev. 2012 Jul 11;2012(7):CD009888. doi: 10.1002/14651858.CD009888.pub2.
6
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
7
Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验:定性证据综合。
Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.
8
The use of Open Dialogue in Trauma Informed Care services for mental health consumers and their family networks: A scoping review.创伤知情护理服务中使用开放对话模式为心理健康消费者及其家庭网络提供服务:范围综述。
J Psychiatr Ment Health Nurs. 2024 Aug;31(4):681-698. doi: 10.1111/jpm.13023. Epub 2024 Jan 17.
9
Factors that impact on the use of mechanical ventilation weaning protocols in critically ill adults and children: a qualitative evidence-synthesis.影响重症成人和儿童机械通气撤机方案使用的因素:一项定性证据综合分析
Cochrane Database Syst Rev. 2016 Oct 4;10(10):CD011812. doi: 10.1002/14651858.CD011812.pub2.
10
Data efficient molecular image representation learning using foundation models.使用基础模型进行数据高效的分子图像表示学习。
Chem Sci. 2025 May 22. doi: 10.1039/d5sc00907c.

本文引用的文献

1
SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning.科学智能体:通过受生物启发的多智能体智能图推理实现科学发现自动化
Adv Mater. 2025 Jun;37(22):e2413523. doi: 10.1002/adma.202413523. Epub 2024 Dec 18.
2
Bidirectional generation of structure and properties through a single molecular foundation model.通过单一分子基础模型实现结构与性质的双向生成。
Nat Commun. 2024 Mar 14;15(1):2323. doi: 10.1038/s41467-024-46440-3.
3
Large language model for molecular chemistry.用于分子化学的大语言模型。
Nat Comput Sci. 2023 Jan;3(1):5. doi: 10.1038/s43588-023-00399-1.
4
Chemical reaction networks and opportunities for machine learning.化学反应网络与机器学习机遇。
Nat Comput Sci. 2023 Jan;3(1):12-24. doi: 10.1038/s43588-022-00369-z. Epub 2023 Jan 16.
5
Probabilistic generative transformer language models for generative design of molecules.用于分子生成设计的概率生成式变压器语言模型。
J Cheminform. 2023 Sep 25;15(1):88. doi: 10.1186/s13321-023-00759-z.
6
Scientific discovery in the age of artificial intelligence.人工智能时代的科学发现。
Nature. 2023 Aug;620(7972):47-60. doi: 10.1038/s41586-023-06221-2. Epub 2023 Aug 2.
7
Computational approaches streamlining drug discovery.计算方法简化药物发现。
Nature. 2023 Apr;616(7958):673-685. doi: 10.1038/s41586-023-05905-z. Epub 2023 Apr 26.
8
ZINC-22─A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery.ZINC-22─一个免费的、数十亿规模的有形化合物数据库,用于配体发现。
J Chem Inf Model. 2023 Feb 27;63(4):1166-1176. doi: 10.1021/acs.jcim.2c01253. Epub 2023 Feb 15.
9
PubChem 2023 update.PubChem 2023 更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380. doi: 10.1093/nar/gkac956.
10
LIMO: Latent Inceptionism for Targeted Molecule Generation.LIMO:用于靶向分子生成的潜在初始主义
Proc Mach Learn Res. 2022 Jul;162:5777-5792.