RSGPT：一种基于一百亿数据点进行预训练的用于逆合成规划的生成式变压器模型。

RSGPT: a generative transformer model for retrosynthesis planning pre-trained on ten billion datapoints.

作者信息

Deng Yafeng, Zhao Xinda, Sun Hanyu, Chen Yu, Wang Xiaorui, Xue Xi, Li Liangning, Song Jianfei, Hsieh Chang-Yu, Hou Tingjun, Pan Xiandao, Alomar Taghrid Saad, Ji Xiangyang, Wang Xiaojian

机构信息

Department of Automation, Tsinghua University, Beijing, China.

Hangzhou Carbonsilicon AI Technology Co., Ltd, Hangzhou, China.

出版信息

Nat Commun. 2025 Jul 31;16(1):7012. doi: 10.1038/s41467-025-62308-6.

DOI:10.1038/s41467-025-62308-6

PMID:40744941

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12314115/

Abstract

Retrosynthesis planning is a crucial task in organic synthesis, and deep-learning methods have enhanced and accelerated this process. With the advancement of the emergence of large language models, the demand for data is rapidly increasing. However, available retrosynthesis data are limited to only millions. Therefore, we pioneer the utilization of the template-based algorithm to generate chemical reaction data, resulting in the production of over 10 billion reaction datapoints. A generative pretrained transformer model is subsequently developed for template-free retrosynthesis planning by pre-training on 10 billion generated data. Inspired by the strategies of large language models, we introduce reinforcement learning to capture the relationships among products, reactants, and templates more accurately. Experiments demonstrate that our model achieves state-of-the-art performance on the benchmark, with a Top-1 accuracy of 63.4%, substantially outperforming previous models.

摘要

逆合成规划是有机合成中的一项关键任务，深度学习方法提升并加速了这一过程。随着大语言模型的出现，对数据的需求迅速增长。然而，现有的逆合成数据仅数百万条。因此，我们率先利用基于模板的算法生成化学反应数据，生成了超过100亿个反应数据点。随后，通过对100亿个生成数据进行预训练，开发了一种生成式预训练变压器模型用于无模板逆合成规划。受大语言模型策略的启发，我们引入强化学习以更准确地捕捉产物、反应物和模板之间的关系。实验表明，我们的模型在基准测试中取得了领先的性能，Top-1准确率为63.4%，大幅超越了之前的模型。