通过结合基于词典和统计的方法进行专利中的化学实体识别。

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

作者信息

Akhondi Saber A, Pons Ewoud, Afzal Zubair, van Haagen Herman, Becker Benedikt F H, Hettne Kristina M, van Mulligen Erik M, Kors Jan A

机构信息

Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam.

Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands.

出版信息

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

DOI:10.1093/database/baw061

PMID:27141091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4852402/

Abstract

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.

摘要

我们描述了一个化学实体识别系统的开发及其在2015年生物创意（BioCreative）的CHEMDNER专利赛道中的应用。这个社区挑战赛包括专利中的化学实体提及（CEMP）识别任务和化学段落检测（CPD）分类任务。我们通过一个将基于字典的方法与统计方法相结合的集成系统来处理这两个任务。为此，我们使用我们的开源索引引擎Peregrine评估了几种词汇资源的性能。我们将基于字典的专利语料库结果与tmChem（一种使用条件随机场分类器的化学识别器）的结果相结合。为了提高tmChem的性能，我们利用了另外三个特征，即词性标签、词元及词向量簇。在训练数据上进行评估时，我们的最终系统在CEMP任务中获得了85.21%的F值，在CPD任务中获得了91.53%的准确率。在测试集上，最佳系统在21个团队中CEMP任务排名第六，F值为86.82%，在九个团队中CPD任务排名第二，准确率为94.23%。最佳集成系统与单独的统计系统之间的性能差异很小。数据库网址：http://biosemantics.org/chemdner-patents 。

相似文献

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Mining chemical patents with an ensemble of open systems.

Database (Oxford). 2016 May 12;2016. doi: 10.1093/database/baw065. Print 2016.

NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition.

Database (Oxford). 2016 Oct 25;2016:baw135. doi: 10.1093/database/baw135.

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.

A neural network approach to chemical and gene/protein entity recognition in patents.

J Cheminform. 2018 Dec 18;10(1):65. doi: 10.1186/s13321-018-0318-3.

tmChem: a high performance approach for chemical named entity recognition and normalization.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.

Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task.

Database (Oxford). 2016 Mar 19;2016. doi: 10.1093/database/baw032. Print 2016.

CHEMDNER: The drugs and chemical names extraction challenge.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Improving Biochemical Named Entity Recognition Using PSO Classifier Selection and Bayesian Combination Methods.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1327-1338. doi: 10.1109/TCBB.2016.2570216. Epub 2016 May 18.

引用本文的文献

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.

Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.

Learning adaptive representations for entity recognition in the biomedical domain.

J Biomed Semantics. 2021 May 17;12(1):10. doi: 10.1186/s13326-021-00238-0.

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.

Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.

Automatic identification of relevant chemical compounds from patents.

Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.

OGER++: hybrid multi-type entity recognition.

J Cheminform. 2019 Jan 21;11(1):7. doi: 10.1186/s13321-018-0326-3.

Entity recognition in the biomedical domain using a hybrid approach.

J Biomed Semantics. 2017 Nov 9;8(1):51. doi: 10.1186/s13326-017-0157-6.

本文引用的文献

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

SureChEMBL: a large-scale, chemically annotated patent document database.

Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8. doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.

J Cheminform. 2015 Nov 16;7:54. doi: 10.1186/s13321-015-0102-6. eCollection 2015.

PubChem Substance and Compound databases.

Nucleic Acids Res. 2016 Jan 4;44(D1):D1202-13. doi: 10.1093/nar/gkv951. Epub 2015 Sep 22.

LeadMine: a grammar and dictionary driven approach to entity recognition.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S5. doi: 10.1186/1758-2946-7-S1-S5. eCollection 2015.

tmChem: a high performance approach for chemical named entity recognition and normalization.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过结合基于词典和统计的方法进行专利中的化学实体识别。

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献