• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型从专利文献中提取高质量化学反应数据集的适用性。

Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature.

作者信息

Vangala Sarveswara Rao, Krishnan Sowmya Ramaswamy, Bung Navneet, Nandagopal Dhandapani, Ramasamy Gomathi, Kumar Satyam, Sankaran Sridharan, Srinivasan Rajgopal, Roy Arijit

机构信息

TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad, 500081, India.

出版信息

J Cheminform. 2024 Nov 26;16(1):131. doi: 10.1186/s13321-024-00928-8.

DOI:10.1186/s13321-024-00928-8
PMID:39593165
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11590295/
Abstract

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.Scientific contributionIn this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

摘要

随着人工智能(AI)的出现,现在有可能从以前未探索的化学空间中设计出多样且新颖的分子。然而,对于化学家来说,合成此类分子是一项挑战。最近,人们尝试开发用于逆合成预测的人工智能模型,这依赖于高质量训练数据集的可用性。在这项工作中,我们探索大语言模型(LLMs)从专利文件中提取高质量化学反应数据的适用性。一项针对早期研究中同一组专利的比较研究表明,所提出的自动化方法可以通过添加26%的新反应来增强当前数据集。在反应挖掘过程中识别出了几个挑战,并针对其中一些挑战提出了替代解决方案。还进行了详细分析,其中在先前整理的数据集中识别出了几个错误条目。使用所提出的管道从更大的专利数据集中提取的反应未来可以提高合成预测模型的准确性和效率。

科学贡献

在这项工作中,我们评估了大语言模型从专利文献中挖掘高质量化学反应数据集的适用性。我们表明,所提出的方法可以通过识别更多化学反应显著提高反应数据库的数量,并通过纠正先前的错误/误报来提高反应数据库的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/c3c33d17834e/13321_2024_928_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/338bef4b155f/13321_2024_928_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/63569e17ceb8/13321_2024_928_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/99687f743341/13321_2024_928_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/9ffc807512fe/13321_2024_928_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/cf9d8848fc05/13321_2024_928_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/93a6747848e9/13321_2024_928_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/30709675dec9/13321_2024_928_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/6b0a7917ae57/13321_2024_928_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/c3c33d17834e/13321_2024_928_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/338bef4b155f/13321_2024_928_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/63569e17ceb8/13321_2024_928_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/99687f743341/13321_2024_928_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/9ffc807512fe/13321_2024_928_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/cf9d8848fc05/13321_2024_928_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/93a6747848e9/13321_2024_928_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/30709675dec9/13321_2024_928_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/6b0a7917ae57/13321_2024_928_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdb3/11590295/c3c33d17834e/13321_2024_928_Fig9_HTML.jpg

相似文献

1
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature.大型语言模型从专利文献中提取高质量化学反应数据集的适用性。
J Cheminform. 2024 Nov 26;16(1):131. doi: 10.1186/s13321-024-00928-8.
2
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统:开发研究
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.
3
Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level.迁移学习:基于小规模化学反应数据集的逆向合成预测扩展到新的水平。
Molecules. 2020 May 19;25(10):2357. doi: 10.3390/molecules25102357.
4
AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry.自动模板:增强用于有机化学机器学习应用的化学反应数据集。
J Cheminform. 2024 Jun 27;16(1):74. doi: 10.1186/s13321-024-00869-2.
5
Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention.利用局部反应性和全局注意力进行深度逆合成反应预测
JACS Au. 2021 Aug 5;1(10):1612-1620. doi: 10.1021/jacsau.1c00246. eCollection 2021 Oct 25.
6
Reaction Templates: Bridging Synthesis Knowledge and Artificial Intelligence.反应模板:连接合成知识与人工智能
Acc Chem Res. 2024 Jul 16;57(14):1964-1972. doi: 10.1021/acs.accounts.4c00261. Epub 2024 Jun 26.
7
Integrating Machine Learning and Large Language Models to Advance Exploration of Electrochemical Reactions.整合机器学习与大语言模型以推动电化学反应探索
Angew Chem Int Ed Engl. 2025 Feb 3;64(6):e202418074. doi: 10.1002/anie.202418074. Epub 2024 Dec 18.
8
Automated Retrosynthesis Planning of Macromolecules Using Large Language Models and Knowledge Graphs.使用大语言模型和知识图谱进行大分子的自动逆合成规划
Macromol Rapid Commun. 2025 Feb 27:e2500065. doi: 10.1002/marc.202500065.
9
AI in Home Care-Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study.家庭护理中的人工智能——对用于未来非正式护理人员培训的大语言模型的评估:观察性比较案例研究
J Med Internet Res. 2025 Apr 28;27:e70703. doi: 10.2196/70703.
10
Deep learning-based automatic action extraction from structured chemical synthesis procedures.基于深度学习从结构化化学合成程序中自动提取操作
PeerJ Comput Sci. 2023 Aug 18;9:e1511. doi: 10.7717/peerj-cs.1511. eCollection 2023.

引用本文的文献

1
Advanced machine learning for innovative drug discovery.用于创新药物发现的先进机器学习技术。
J Cheminform. 2025 Aug 8;17(1):122. doi: 10.1186/s13321-025-01061-w.
2
Implementation of an open chemistry knowledge base with a Semantic Wiki.使用语义维基实现一个开放化学知识库。
J Cheminform. 2025 Jul 6;17(1):99. doi: 10.1186/s13321-025-01037-w.
3
Leveraging Prompt Engineering in Large Language Models for Accelerating Chemical Research.利用大语言模型中的提示工程加速化学研究。
ACS Cent Sci. 2025 Apr 2;11(4):511-519. doi: 10.1021/acscentsci.4c01935. eCollection 2025 Apr 23.