• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

OpenChemIE:一个化学文献信息抽取工具包。

OpenChemIE: An Information Extraction Toolkit for Chemistry Literature.

机构信息

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.

Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.

出版信息

J Chem Inf Model. 2024 Jul 22;64(14):5521-5534. doi: 10.1021/acs.jcim.4c00572. Epub 2024 Jul 1.

DOI:10.1021/acs.jcim.4c00572
PMID:38950894
Abstract

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.

摘要

从化学文献中提取信息对于构建最新的基于数据的化学反应数据库至关重要。完整的提取需要结合文本、表格和图形中的信息,而之前的工作主要集中在从单一模态中提取反应。在本文中,我们提出了 OpenChemIE 来解决这个复杂的挑战,并能够在文档级别上提取反应数据。OpenChemIE 分两步解决这个问题:从各个模态中提取相关信息,然后整合结果以获得最终的反应列表。对于第一步,我们使用专门的神经模型,每个模型都针对化学信息提取的特定任务,例如从文本或图形中解析分子或反应。然后,我们使用化学启发式算法整合这些模块的信息,允许从反应条件和底物范围研究中提取精细的反应数据。我们的机器学习模型在单独评估时达到了最先进的性能,并且我们精心注释了一个具有 R 基团的反应方案的具有挑战性的数据集,以整体评估我们的管道,实现了 69.5%的 F1 分数。此外,当直接与 Reaxys 化学数据库进行比较时,OpenChemIE 的反应提取结果的准确率达到 64.3%。OpenChemIE 最适合于有机化学文献的信息提取,其中分子通常以平面图形表示或写在文本中,并可以合并为 SMILES 格式。我们作为一个开源软件包免费向公众提供 OpenChemIE,并通过一个网络界面提供。

相似文献

1
OpenChemIE: An Information Extraction Toolkit for Chemistry Literature.OpenChemIE:一个化学文献信息抽取工具包。
J Chem Inf Model. 2024 Jul 22;64(14):5521-5534. doi: 10.1021/acs.jcim.4c00572. Epub 2024 Jul 1.
2
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.ChemDataExtractor:一个用于从科学文献中自动提取化学信息的工具包。
J Chem Inf Model. 2016 Oct 24;56(10):1894-1904. doi: 10.1021/acs.jcim.6b00207. Epub 2016 Oct 6.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Versatile Deep Learning Pipeline for Transferable Chemical Data Extraction.用于可转移化学数据提取的通用深度学习管道
J Chem Inf Model. 2024 Aug 12;64(15):5888-5899. doi: 10.1021/acs.jcim.4c00816. Epub 2024 Jul 15.
5
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
6
Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别
Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.
7
Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述:精准医学中的蛋白质相互作用和突变挖掘。
Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.
8
Automated electrosynthesis reaction mining with multimodal large language models (MLLMs).使用多模态大语言模型(MLLMs)进行自动化电合成反应挖掘。
Chem Sci. 2024 Oct 9;15(43):17881-91. doi: 10.1039/d4sc04630g.
9
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature.命名实体识别和规范化在材料科学文献的大规模信息抽取中的应用。
J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.
10
RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing.RxnScribe:一种用于反应图解析的序列生成模型。
J Chem Inf Model. 2023 Jul 10;63(13):4030-4041. doi: 10.1021/acs.jcim.3c00439. Epub 2023 Jun 27.

引用本文的文献

1
Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry.关于人工智能在化学领域潜力的跨学科观点。
Chem Soc Rev. 2025 Apr 25. doi: 10.1039/d5cs00146c.
2
Machine learning-guided strategies for reaction conditions design and optimization.用于反应条件设计与优化的机器学习引导策略。
Beilstein J Org Chem. 2024 Oct 4;20:2476-2492. doi: 10.3762/bjoc.20.212. eCollection 2024.
3
Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis.催化(有机)催化:机器学习在对映选择性有机催化中的应用趋势
Beilstein J Org Chem. 2024 Sep 10;20:2280-2304. doi: 10.3762/bjoc.20.196. eCollection 2024.