一个用于化学领域的开源大型编码器-解码器基础模型系列。

An open-source family of large encoder-decoder foundation models for chemistry.

作者信息

Soares Eduardo, Vital Brazil Emilio, Shirasuna Victor, Zubarev Dmitry, Cerqueira Renato, Schmidt Kristin

机构信息

IBM Research, Rio de Janeiro, Brazil.

IBM Research, Almaden, CA, USA.

出版信息

Commun Chem. 2025 Jul 1;8(1):193. doi: 10.1038/s42004-025-01585-0.

DOI:10.1038/s42004-025-01585-0

PMID:40593316

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12216393/

Abstract

The use of foundation models has extended from natural language processing to molecular modeling. In this context, large-scale pre-training strategies have been applied to chemical language models to enable representation learning across diverse tasks. Here we introduce a family of encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million molecular sequences from PubChem. These models support a range of applications, including property estimation and reaction outcome prediction. We evaluate two model variants across several benchmark datasets and show that they match or exceed existing approaches. We also assess the structure of the learned representations and find that the embedding space supports few-shot learning and separates molecules based on chemically relevant features. This structure appears to result from the decoder-based reconstruction objective used during pre-training. These findings suggest that the proposed models can serve as general-purpose tools for molecular analysis and reasoning with minimal supervision.

摘要

基础模型的应用已从自然语言处理扩展到分子建模。在此背景下，大规模预训练策略已应用于化学语言模型，以实现跨多种任务的表征学习。本文我们介绍了一族编码器-解码器化学基础模型，这些模型是在来自PubChem的9100万个分子序列的精选数据集上进行预训练的。这些模型支持一系列应用，包括性质估计和反应结果预测。我们在多个基准数据集上评估了两种模型变体，结果表明它们与现有方法相当或更优。我们还评估了学习到的表征的结构，发现嵌入空间支持少样本学习，并根据化学相关特征对分子进行区分。这种结构似乎源于预训练期间使用的基于解码器的重建目标。这些发现表明，所提出的模型可以作为在最少监督下进行分子分析和推理的通用工具。