Zhang Wei, Wang Qinggong, Kong Xiangtai, Xiong Jiacheng, Ni Shengkun, Cao Duanhua, Niu Buying, Chen Mingan, Li Yameng, Zhang Runze, Wang Yitian, Zhang Lehan, Li Xutong, Xiong Zhaoping, Shi Qian, Huang Ziming, Fu Zunyun, Zheng Mingyue
Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China
University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.
Chem Sci. 2024 Jun 7;15(27):10600-10611. doi: 10.1039/d4sc00924j. eCollection 2024 Jul 10.
Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.
从复杂多样的化学文本中提取知识,对实验化学家和计算化学家来说都是一项关键任务。由于化学语言和科学文献的复杂性,这项任务仍被认为极具挑战性。本研究探讨了微调后的大语言模型(LLMs)在五个复杂的化学文本挖掘任务上的能力:化合物实体识别、反应角色标注、金属有机框架(MOF)合成信息提取、核磁共振光谱(NMR)数据提取以及将反应段落转换为操作序列。微调后的大语言模型表现出令人印象深刻的性能,显著减少了对重复且广泛的提示工程实验的需求。为了进行比较,我们通过提示工程引导ChatGPT(GPT - 3.5 - turbo)和GPT - 4,并对GPT - 3.5 - turbo以及其他开源大语言模型(如Mistral、Llama3、Llama2、T5和BART)进行了微调。结果表明,微调后的ChatGPT模型在所有任务中表现出色。它们在这些任务上仅使用最少的标注数据就实现了69%至95%的精确准确率。它们甚至超越了那些基于大量领域内数据的任务自适应预训练和微调模型。值得注意的是,微调后的Mistral和Llama3表现出了竞争能力。鉴于它们的通用性、稳健性和低代码能力,将微调后的大语言模型作为灵活有效的自动化数据采集工具包加以利用,可能会彻底改变化学知识提取的方式。