化学实体识别：基于词典和基于语法的方法相结合。

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, P.O. Box 2040, Rotterdam, CA 3000, The Netherlands.

Department of Human Genetics, Leiden University Medical Center, P.O. Box 9600, Leiden, RC 2300, The Netherlands.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.

DOI:10.1186/1758-2946-7-S1-S10

PMID:25810767

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331686/

Abstract

BACKGROUND

The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals.

RESULTS

The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions.

CONCLUSIONS

We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.

摘要

背景

过去十年，化学领域的文献数量呈爆炸式增长。可用文献的数量不断增加，使得从这些非结构化文本中提取相关新信息变得越来越困难。BioCreative CHEMDNER 挑战赛邀请开发系统，用于自动识别文本中的化学物质（CEM 任务）和对文档级别的识别化合物进行排名（CDI 任务）。我们研究了一种集成方法，该方法使用基于字典的命名实体识别，并结合基于语法的识别器从文本中提取化合物。我们使用开源索引系统（Peregrine）评估了十种不同的商业和公开可用的词汇资源的性能，同时结合了三种不同的化学化合物识别器和一组正则表达式来识别化学数据库标识符。还研究了不同停用词列表、大小写敏感匹配和使用分块信息的效果。我们专注于提供化学结构信息的词汇资源。为了对文本中找到的不同化合物进行排名，我们使用基于术语频率的归一化比的术语置信度得分，该比在化学和非化学期刊中。

结果

使用停用词列表极大地提高了基于字典的识别性能，但使用分块信息没有额外的好处。ChEBI 和 HMDB 作为词汇资源的组合、基于语法的识别的 LeadMine 工具以及正则表达式，优于任何单个系统。在测试集上，CEM 任务的 F 分数为 77.8%（召回率 71.2%，精度 85.8%），CDI 任务的 F 分数为 77.6%（召回率 71.7%，精度 84.6%）。错过的术语主要是由于标记化问题、公式识别差和术语连接。

结论

我们开发了一种集成系统，该系统结合了基于字典和语法的化学命名实体识别方法，优于我们考虑的任何单个系统。该系统能够为大多数找到的化合物提供结构信息。改进标记化和更好地识别特定实体类型可能会进一步提高系统性能。

相似文献

Recognition of chemical entities: combining dictionary-based and grammar-based approaches.化学实体识别：基于词典和基于语法的方法相结合。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10. doi: 10.1186/1758-2946-7-S1-S10. eCollection 2015.

LeadMine: a grammar and dictionary driven approach to entity recognition.LeadMine：一种基于语法和词典的实体识别方法。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S5. doi: 10.1186/1758-2946-7-S1-S5. eCollection 2015.

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.通过结合基于词典和统计的方法进行专利中的化学实体识别。

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.

A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature.基于 CRF 的生物医学文献中化学实体提及识别系统。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015.

A document processing pipeline for annotating chemical entities in scientific documents.用于在科学文献中标记化学实体的文档处理管道。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S7. doi: 10.1186/1758-2946-7-S1-S7. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

引用本文的文献

Ensemble pretrained language models to extract biomedical knowledge from literature.基于预训练语言模型的方法从文献中提取生物医学知识。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1904-1911. doi: 10.1093/jamia/ocae061.

A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition.前缀和注意力图判别融合引导的生物医学命名实体识别注意力机制。

BMC Bioinformatics. 2023 Feb 8;24(1):42. doi: 10.1186/s12859-023-05172-9.

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach.使用朴素贝叶斯分类器方法在科学出版物文本中进行化学命名实体识别。

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

Extracting Drug Names and Associated Attributes From Discharge Summaries: Text Mining Study.从出院小结中提取药物名称及相关属性：文本挖掘研究

JMIR Med Inform. 2021 May 5;9(5):e24678. doi: 10.2196/24678.

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.ChEMU 2020：自然语言处理方法对从化学专利中提取信息有效。

Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.

Improving biomedical named entity recognition with syntactic information.利用句法信息提高生物医学命名实体识别。

BMC Bioinformatics. 2020 Nov 25;21(1):539. doi: 10.1186/s12859-020-03834-6.

Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes.基于知识增强的生物医学命名实体识别与规范：在蛋白质和基因上的应用。

BMC Bioinformatics. 2020 Jan 30;21(1):35. doi: 10.1186/s12859-020-3375-3.

Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。

Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules.将手工操作搁置一旁：用于化学命名实体识别的高效深度卷积神经网络-循环神经网络架构，无需手工规则。

J Cheminform. 2018 May 23;10(1):28. doi: 10.1186/s13321-018-0280-0.

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.通过结合基于词典和统计的方法进行专利中的化学实体识别。

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

本文引用的文献

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用：方法、工具和应用。

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

LeadMine: a grammar and dictionary driven approach to entity recognition.LeadMine：一种基于语法和词典的实体识别方法。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S5. doi: 10.1186/1758-2946-7-S1-S5. eCollection 2015.

CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统，混合条件随机场和多尺度词聚类。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.

tmChem: a high performance approach for chemical named entity recognition and normalization.tmChem：一种用于化学命名实体识别和标准化的高性能方法。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3. doi: 10.1186/1758-2946-7-S1-S3. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

CheNER: a tool for the identification of chemical entities and their classes in biomedical literature.CheNER：一个用于在生物医学文献中识别化学实体及其类别的工具。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S15. doi: 10.1186/1758-2946-7-S1-S15. eCollection 2015.

Chemical entity extraction using CRF and an ensemble of extractors.基于条件随机场和集成抽取器的化学实体抽取。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S12. doi: 10.1186/1758-2946-7-S1-S12. eCollection 2015.

CHEMDNER: The drugs and chemical names extraction challenge.CHEMDNER：药物和化学名称提取挑战赛。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. doi: 10.1186/1758-2946-7-S1-S1. eCollection 2015.

Chemical named entities recognition: a review on approaches and applications.化学命名实体识别：方法与应用综述

J Cheminform. 2014 Apr 28;6:17. doi: 10.1186/1758-2946-6-17. eCollection 2014.

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters.大规模生物医学概念识别：当前自动标注器及其参数的评估。

BMC Bioinformatics. 2014 Feb 26;15:59. doi: 10.1186/1471-2105-15-59.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。