• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对化学文本挖掘对大语言模型进行微调。

Fine-tuning large language models for chemical text mining.

作者信息

Zhang Wei, Wang Qinggong, Kong Xiangtai, Xiong Jiacheng, Ni Shengkun, Cao Duanhua, Niu Buying, Chen Mingan, Li Yameng, Zhang Runze, Wang Yitian, Zhang Lehan, Li Xutong, Xiong Zhaoping, Shi Qian, Huang Ziming, Fu Zunyun, Zheng Mingyue

机构信息

Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences 555 Zuchongzhi Road Shanghai 201203 China

University of Chinese Academy of Sciences No. 19A Yuquan Road Beijing 100049 China.

出版信息

Chem Sci. 2024 Jun 7;15(27):10600-10611. doi: 10.1039/d4sc00924j. eCollection 2024 Jul 10.

DOI:10.1039/d4sc00924j
PMID:38994403
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11234886/
Abstract

Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

摘要

从复杂多样的化学文本中提取知识,对实验化学家和计算化学家来说都是一项关键任务。由于化学语言和科学文献的复杂性,这项任务仍被认为极具挑战性。本研究探讨了微调后的大语言模型(LLMs)在五个复杂的化学文本挖掘任务上的能力:化合物实体识别、反应角色标注、金属有机框架(MOF)合成信息提取、核磁共振光谱(NMR)数据提取以及将反应段落转换为操作序列。微调后的大语言模型表现出令人印象深刻的性能,显著减少了对重复且广泛的提示工程实验的需求。为了进行比较,我们通过提示工程引导ChatGPT(GPT - 3.5 - turbo)和GPT - 4,并对GPT - 3.5 - turbo以及其他开源大语言模型(如Mistral、Llama3、Llama2、T5和BART)进行了微调。结果表明,微调后的ChatGPT模型在所有任务中表现出色。它们在这些任务上仅使用最少的标注数据就实现了69%至95%的精确准确率。它们甚至超越了那些基于大量领域内数据的任务自适应预训练和微调模型。值得注意的是,微调后的Mistral和Llama3表现出了竞争能力。鉴于它们的通用性、稳健性和低代码能力,将微调后的大语言模型作为灵活有效的自动化数据采集工具包加以利用,可能会彻底改变化学知识提取的方式。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/ad21a5022bb9/d4sc00924j-f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/298bbc89e10b/d4sc00924j-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/f1f79c63e670/d4sc00924j-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/fca237ee6dd2/d4sc00924j-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/baf03f866821/d4sc00924j-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/df3fc3384b8d/d4sc00924j-f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/ad21a5022bb9/d4sc00924j-f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/298bbc89e10b/d4sc00924j-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/f1f79c63e670/d4sc00924j-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/fca237ee6dd2/d4sc00924j-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/baf03f866821/d4sc00924j-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/df3fc3384b8d/d4sc00924j-f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d284/11234886/ad21a5022bb9/d4sc00924j-f6.jpg

相似文献

1
Fine-tuning large language models for chemical text mining.针对化学文本挖掘对大语言模型进行微调。
Chem Sci. 2024 Jun 7;15(27):10600-10611. doi: 10.1039/d4sc00924j. eCollection 2024 Jul 10.
2
Automated Extraction of Patient-Centered Outcomes After Breast Cancer Treatment: An Open-Source Large Language Model-Based Toolkit.基于开源大语言模型的乳腺癌治疗后患者为中心结局自动提取工具包。
JCO Clin Cancer Inform. 2024 Aug;8:e2300258. doi: 10.1200/CCI.23.00258.
3
Improving large language models for clinical named entity recognition via prompt engineering.通过提示工程改进临床命名实体识别的大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.
4
Relation extraction using large language models: a case study on acupuncture point locations.基于大语言模型的关系抽取研究:以穴位定位为例。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2622-2631. doi: 10.1093/jamia/ocae233.
5
CACER: Clinical concept Annotations for Cancer Events and Relations.CACER:癌症事件与关系的临床概念注释。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.
6
Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks.太乙:一个用于多种生物医学任务的双语精调大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1865-1874. doi: 10.1093/jamia/ocae037.
7
A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.对基准生物医学文本处理任务中大型语言模型的全面评估。
Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.
8
Zero-Shot LLMs for Named Entity Recognition: Targeting Cardiac Function Indicators in German Clinical Texts.零样本语言模型在命名实体识别中的应用:以德国临床文本中的心脏功能指标为例。
Stud Health Technol Inform. 2024 Aug 30;317:228-234. doi: 10.3233/SHTI240861.
9
BioInstruct: instruction tuning of large language models for biomedical natural language processing.BioInstruct:用于生物医学自然语言处理的大型语言模型的指令调整。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.
10
Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.

引用本文的文献

1
Synergizing a knowledge graph and large language model for relay catalysis pathway recommendation.将知识图谱与大语言模型相结合用于接力催化途径推荐。
Natl Sci Rev. 2025 Jul 14;12(8):nwaf271. doi: 10.1093/nsr/nwaf271. eCollection 2025 Aug.
2
Advancing medical question answering with a knowledge embedding transformer.利用知识嵌入变压器推进医学问答
PLoS One. 2025 Aug 18;20(8):e0329606. doi: 10.1371/journal.pone.0329606. eCollection 2025.
3
Artificial Intelligence Paradigms for Next-Generation Metal-Organic Framework Research.

本文引用的文献

1
Language models and protocol standardization guidelines for accelerating synthesis planning in heterogeneous catalysis.用于加速多相催化中合成规划的语言模型和协议标准化指南。
Nat Commun. 2023 Dec 2;14(1):7964. doi: 10.1038/s41467-023-43836-5.
2
AI-driven robotic chemist for autonomous synthesis of organic molecules.用于有机分子自主合成的人工智能驱动机器人化学家。
Sci Adv. 2023 Nov 3;9(44):eadj0461. doi: 10.1126/sciadv.adj0461. Epub 2023 Nov 1.
3
αExtractor: a system for automatic extraction of chemical information from biomedical literature.
面向下一代金属有机框架研究的人工智能范式
J Am Chem Soc. 2025 Jul 9;147(27):23367-23380. doi: 10.1021/jacs.5c08214. Epub 2025 Jun 24.
4
NMRExtractor: leveraging large language models to construct an experimental NMR database from open-source scientific publications.NMRExtractor:利用大语言模型从开源科学出版物构建实验性核磁共振数据库。
Chem Sci. 2025 May 28. doi: 10.1039/d4sc08802f.
5
Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry.关于人工智能在化学领域潜力的跨学科观点。
Chem Soc Rev. 2025 Apr 25. doi: 10.1039/d5cs00146c.
6
Augmented and Programmatically Optimized LLM Prompts Reduce Chemical Hallucinations.增强型和通过编程优化的大语言模型提示可减少化学幻觉。
J Chem Inf Model. 2025 May 12;65(9):4274-4280. doi: 10.1021/acs.jcim.4c02322. Epub 2025 Apr 22.
7
Large Language Models and Their Applications in Drug Discovery and Development: A Primer.大语言模型及其在药物发现与开发中的应用:入门指南。
Clin Transl Sci. 2025 Apr;18(4):e70205. doi: 10.1111/cts.70205.
8
A review of large language models and autonomous agents in chemistry.化学领域中大型语言模型与自主智能体的综述。
Chem Sci. 2024 Dec 9;16(6):2514-2572. doi: 10.1039/d4sc03921a. eCollection 2025 Feb 5.
9
Emerging trends in the optimization of organic synthesis through high-throughput tools and machine learning.通过高通量工具和机器学习优化有机合成的新趋势。
Beilstein J Org Chem. 2025 Jan 6;21:10-38. doi: 10.3762/bjoc.21.3. eCollection 2025.
10
Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions.整合机器学习与大语言模型以推动电化学反应探索
Angew Chem Int Ed Engl. 2025 Feb 3;64(6):e202418074. doi: 10.1002/anie.202418074. Epub 2024 Dec 18.
α提取器:一种从生物医学文献中自动提取化学信息的系统。
Sci China Life Sci. 2024 Mar;67(3):618-621. doi: 10.1007/s11427-023-2388-x. Epub 2023 Sep 26.
4
ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes.反应数据提取器 2.0:一种从化学反应图中提取数据的深度学习方法。
J Chem Inf Model. 2023 Oct 9;63(19):6053-6067. doi: 10.1021/acs.jcim.3c00422. Epub 2023 Sep 20.
5
An extensive benchmark study on biomedical text generation and mining with ChatGPT.一项关于使用ChatGPT进行生物医学文本生成和挖掘的广泛基准研究。
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad557.
6
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.用于文本挖掘和金属有机框架合成预测的ChatGPT化学助手
J Am Chem Soc. 2023 Aug 16;145(32):18048-18062. doi: 10.1021/jacs.3c05819. Epub 2023 Aug 7.
7
Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data.化学数据共享:经验教训和强制结构化反应数据的案例
J Chem Inf Model. 2023 Jul 24;63(14):4253-4265. doi: 10.1021/acs.jcim.3c00607. Epub 2023 Jul 5.
8
RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing.RxnScribe:一种用于反应图解析的序列生成模型。
J Chem Inf Model. 2023 Jul 10;63(13):4030-4041. doi: 10.1021/acs.jcim.3c00439. Epub 2023 Jun 27.
9
MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation.MolScribe:通过图像到图形生成实现强大的分子结构识别。
J Chem Inf Model. 2023 Apr 10;63(7):1925-1934. doi: 10.1021/acs.jcim.2c01480. Epub 2023 Mar 27.
10
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT.大语言模型理解化学吗?与ChatGPT的一次对话。
J Chem Inf Model. 2023 Mar 27;63(6):1649-1655. doi: 10.1021/acs.jcim.3c00285. Epub 2023 Mar 16.