• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChEMU 2020:自然语言处理方法对从化学专利中提取信息有效。

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.

作者信息

He Jiayuan, Nguyen Dat Quoc, Akhondi Saber A, Druckenbrodt Christian, Thorne Camilo, Hoessel Ralph, Afzal Zubair, Zhai Zenan, Fang Biaoyan, Yoshikawa Hiyori, Albahem Ameer, Cavedon Lawrence, Cohn Trevor, Baldwin Timothy, Verspoor Karin

机构信息

The University of Melbourne, Parkville, VIC, Australia.

RMIT University, Melbourne, VIC, Australia.

出版信息

Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.

DOI:10.3389/frma.2021.654438
PMID:33870071
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8028406/
Abstract

Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) , requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) , which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.

摘要

化学专利是有关新化学化合物的宝贵信息来源,这对药物研发过程至关重要。然而,由于现有专利数量庞大且化学专利具有复杂的语言特性,对化学专利进行自动信息提取是一项具有挑战性的任务。作为2020年评估论坛会议和实验室(CLEF2020)一部分的2020年化学信息学爱思唯尔墨尔本大学(ChEMU)评估实验室,旨在支持开发用于化学专利的先进文本挖掘技术。2020年ChEMU实验室提出了两项基本信息提取任务,重点关注化学专利中描述的化学反应过程:(1),要求识别基本化学实体及其在化学反应中的作用以及反应条件;(2),旨在识别与化学反应中涉及的实体相关的事件步骤。2020年ChEMU实验室收到了37个团队注册和46次运行结果。总体而言,这些任务的提交结果表现超出我们的预期,顶级系统的表现优于强大的基线。我们进一步表明这些方法对测试数据采样的变化具有鲁棒性。我们详细概述了2020年ChEMU语料库及其注释,表明注释者之间的一致性非常高。我们还介绍了参与者采用的方法,对他们的表现进行了详细分析,并仔细考虑了数据泄露对结果解释的潜在影响。2020年ChEMU实验室已证明自动化方法支持化学专利关键信息提取的可行性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/7a11aac88fb0/frma-06-654438-g0016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/538b385f6da7/frma-06-654438-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/0fab008560c2/frma-06-654438-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ec7e72f1a6a0/frma-06-654438-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/14b0c7865aff/frma-06-654438-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/8a0ccce8ea62/frma-06-654438-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/50b0ea28789a/frma-06-654438-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/bb09a0f94dc0/frma-06-654438-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/279b5e0071ac/frma-06-654438-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/9dbc3e081ce0/frma-06-654438-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/7a09d2dff752/frma-06-654438-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ca9c07637fc2/frma-06-654438-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/29c190a390c6/frma-06-654438-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/132035ae2381/frma-06-654438-g0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/2e64277e116a/frma-06-654438-g0014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ae5364c64621/frma-06-654438-g0015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/7a11aac88fb0/frma-06-654438-g0016.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/538b385f6da7/frma-06-654438-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/0fab008560c2/frma-06-654438-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ec7e72f1a6a0/frma-06-654438-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/14b0c7865aff/frma-06-654438-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/8a0ccce8ea62/frma-06-654438-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/50b0ea28789a/frma-06-654438-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/bb09a0f94dc0/frma-06-654438-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/279b5e0071ac/frma-06-654438-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/9dbc3e081ce0/frma-06-654438-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/7a09d2dff752/frma-06-654438-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ca9c07637fc2/frma-06-654438-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/29c190a390c6/frma-06-654438-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/132035ae2381/frma-06-654438-g0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/2e64277e116a/frma-06-654438-g0014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/ae5364c64621/frma-06-654438-g0015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40bd/8028406/7a11aac88fb0/frma-06-654438-g0016.jpg

相似文献

1
ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.ChEMU 2020:自然语言处理方法对从化学专利中提取信息有效。
Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.
2
From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents.从词法分析到自我监督:构建用于专利中化学反应的高性能信息提取系统。
Front Res Metr Anal. 2021 Dec 22;6:691105. doi: 10.3389/frma.2021.691105. eCollection 2021.
3
Identifying Chemical Reactions and Their Associated Attributes in Patents.识别专利中的化学反应及其相关属性。
Front Res Metr Anal. 2021 Jul 12;6:688353. doi: 10.3389/frma.2021.688353. eCollection 2021.
4
Annotated chemical patent corpus: a gold standard for text mining.带注释的化学专利语料库:文本挖掘的黄金标准。
PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.
5
Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别
Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.
6
Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.结合命名实体识别和未知词处理的本体事件抽取的主动学习
J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.
7
CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER:药物和化学名称提取挑战赛。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.
8
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.管理预期:对通过从专利中自动提取化学结构生成的化学数据库的评估。
J Cheminform. 2015 Oct 6;7(1):49. doi: 10.1186/s13321-015-0097-z. eCollection 2015 Dec.
9
Development of an information retrieval tool for biomedical patents.生物医学专利信息检索工具的开发。
Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.
10
Family History Extraction From Synthetic Clinical Narratives Using Natural Language Processing: Overview and Evaluation of a Challenge Data Set and Solutions for the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing (OHNLP) Competition.利用自然语言处理从合成临床叙述中提取家族病史:2019年国家自然语言处理临床挑战(n2c2)/开放健康自然语言处理(OHNLP)竞赛的挑战数据集概述与评估及解决方案
JMIR Med Inform. 2021 Jan 27;9(1):e24008. doi: 10.2196/24008.

引用本文的文献

1
EnzChemRED, a rich enzyme chemistry relation extraction dataset.EnzChemRED,一个富含酶化学关系提取的数据集。
Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.
2
Mining Patents with Large Language Models Elucidates the Chemical Function Landscape.利用大语言模型挖掘专利阐明化学功能格局。
ArXiv. 2023 Dec 18:arXiv:2309.08765v2.
3
Asking the right questions for mutagenicity prediction from BioMedical text.从生物医学文本中预测致突变性应提出的正确问题。

本文引用的文献

1
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
2
Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。
Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.
3
LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.LSTMVoter:使用序列标注工具集合进行化学命名实体识别。
NPJ Syst Biol Appl. 2023 Dec 18;9(1):63. doi: 10.1038/s41540-023-00324-2.
4
Deep learning-based automatic action extraction from structured chemical synthesis procedures.基于深度学习从结构化化学合成程序中自动提取操作
PeerJ Comput Sci. 2023 Aug 18;9:e1511. doi: 10.7717/peerj-cs.1511. eCollection 2023.
5
Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.全文文章中的化学物质鉴定与标引:NLM-Chem 在 BioCreative VII 挑战赛中的概述
Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.
6
Cascade Processes with Micellar Reaction Media: Recent Advances and Future Directions.胶束反应介质中的级联过程:最新进展和未来方向。
Molecules. 2022 Aug 31;27(17):5611. doi: 10.3390/molecules27175611.
7
From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents.从词法分析到自我监督:构建用于专利中化学反应的高性能信息提取系统。
Front Res Metr Anal. 2021 Dec 22;6:691105. doi: 10.3389/frma.2021.691105. eCollection 2021.
8
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.用于健康与生命科学语料库中有效命名实体识别的深度掩码语言模型集成
Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.
9
Identifying Chemical Reactions and Their Associated Attributes in Patents.识别专利中的化学反应及其相关属性。
Front Res Metr Anal. 2021 Jul 12;6:688353. doi: 10.3389/frma.2021.688353. eCollection 2021.
J Cheminform. 2019 Jan 10;11(1):3. doi: 10.1186/s13321-018-0327-2.
4
Artificial intelligence in drug development: present status and future prospects.人工智能在药物研发中的应用:现状与未来前景。
Drug Discov Today. 2019 Mar;24(3):773-780. doi: 10.1016/j.drudis.2018.11.014. Epub 2018 Nov 22.
5
Extracting chemical-protein relations with ensembles of SVM and deep learning models.基于 SVM 和深度学习模型集成提取化学-蛋白质关系。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay073.
6
CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.CLAMP - 一个用于高效构建定制化临床自然语言处理管道的工具包。
J Am Med Inform Assoc. 2018 Mar 1;25(3):331-336. doi: 10.1093/jamia/ocx132.
7
AI-powered drug discovery captures pharma interest.人工智能驱动的药物发现引起了制药行业的兴趣。
Nat Biotechnol. 2017 Jul 12;35(7):604-605. doi: 10.1038/nbt0717-604.
8
Information Retrieval and Text Mining Technologies for Chemistry.化学信息检索与文本挖掘技术。
Chem Rev. 2017 Jun 28;117(12):7673-7761. doi: 10.1021/acs.chemrev.6b00851. Epub 2017 May 5.
9
Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用:方法、工具和应用。
Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.
10
Chemical entity recognition in patents by combining dictionary-based and statistical approaches.通过结合基于词典和统计的方法进行专利中的化学实体识别。
Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.