Yang Yifei, Shi Runhan, Li Zuchao, Jiang Shu, Lu Bao-Liang, Zhao Qibin, Yang Yang, Zhao Hai
School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China.
Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai 200240, China.
Research (Wash D C). 2025 Sep 10;8:0827. doi: 10.34133/research.0827. eCollection 2025.
Large language models (LLMs) have showcased remarkable capabilities in the realm of AI for Science, and chemistry has greatly benefited from the advancement of AI tools. With a strong capacity for learning sequential data like natural language, LLMs offer immense potential. Despite this promise, the application of LLMs in chemistry remains limited, with few models specifically designed for chemical data and tasks. Hence, we propose leveraging LLMs to comprehensively model both chemical sequences and natural language sequences, aiming to tackle diverse chemical tasks. We introduce BatGPT-Chem, a general foundation large-scale model with 15 billion parameters tailored for chemical engineering. Built on a corpus of over 100 million chemical instances, BatGPT-Chem specializes in 5 core tasks: retrosynthesis prediction, molecule design, molecule description, product inference, and yield prediction. BatGPT-Chem comprehensively models the information flow between chemical language and natural language, enabling full-spectrum prediction across chemical tasks. It is one of the largest bilingual chemistry-specific LLMs, supporting both English and Chinese for input and output. BatGPT-Chem is also the first automated retrosynthesis tool capable of explicitly predicting reaction conditions, a critical but often overlooked aspect in previous models. Through rigorous zero-shot evaluations, BatGPT-Chem demonstrates state-of-the-art performance, surpassing both existing chemical LLMs and general-purpose models in accuracy and validity across a diverse range of tasks. Notably, it demonstrates superior ability in predicting both reactants and reaction conditions, as well as strong generalization in low-data settings. These results suggest that BatGPT-Chem is among the most advanced and practical chemical LLMs, with strong potential to support real-world applications in synthesis planning, drug discovery, and materials design.
大语言模型(LLMs)在人工智能用于科学领域展现出了卓越能力,化学领域也因人工智能工具的进步而受益匪浅。由于大语言模型具有强大的学习自然语言等序列数据的能力,因此具有巨大潜力。尽管有此前景,但大语言模型在化学领域的应用仍然有限,专门针对化学数据和任务设计的模型很少。因此,我们建议利用大语言模型对化学序列和自然语言序列进行全面建模,以解决各种化学任务。我们推出了BatGPT-Chem,这是一个为化学工程量身定制的具有150亿参数的通用基础大规模模型。基于超过1亿个化学实例的语料库构建,BatGPT-Chem专注于5个核心任务:逆合成预测、分子设计、分子描述、产物推断和产率预测。BatGPT-Chem全面模拟化学语言和自然语言之间的信息流,能够对各种化学任务进行全谱预测。它是最大的双语化学专用大语言模型之一,并支持中英文输入和输出。BatGPT-Chem也是第一个能够明确预测反应条件的自动化逆合成工具,这是先前模型中一个关键但经常被忽视的方面。通过严格的零样本评估,BatGPT-Chem展示了其在各种任务中的最先进性能,在准确性和有效性方面超过了现有的化学大语言模型和通用模型。值得注意的是,它在预测反应物和反应条件方面表现出卓越能力,以及在低数据环境中的强大泛化能力。这些结果表明,BatGPT-Chem是最先进且实用的化学大语言模型之一,在支持合成规划、药物发现和材料设计等实际应用方面具有强大潜力。