Suppr超能文献

利用词向量将领域知识融入化学和生物医学命名实体识别。

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.

机构信息

Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea.

出版信息

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.

Abstract

BACKGROUND

Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data.

RESULTS

We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.

摘要

背景

在对生化文本数据进行有效文本挖掘之前,化学和生物医学命名实体识别(NER)是一项必不可少的前提任务。由于生物医学文献数量的最近增长,利用未标记的文本数据来提高系统性能一直是文本挖掘中的一个活跃而具有挑战性的研究课题。我们提出了一种半监督学习方法,该方法可以有效地利用未标记的数据,将领域知识纳入命名实体识别模型,并提高系统性能。所提出的方法包括文本预处理的自然语言处理(NLP)任务、从大量文本数据中学习单词表示特征以进行特征提取,以及用于标记分类的条件随机场。除了领域中的自由文本之外,该方法不依赖于任何词汇表或字典,以便使系统适用于生物文本数据中的其他 NER 任务。

结果

我们使用所提出的方法扩展了生物医学 NER 系统 BANNER。这产生了一个集成系统,可以应用于化学和药物 NER 或生物医学 NER。我们将 BANNER 系统的这个分支称为 BANNER-CHEMDNER,它可以扩展到数百万个文档,每分钟处理约 530 个文档,可通过 XML 进行配置,并可通过使用 BANNER 非结构化信息管理体系结构(UIMA)接口插入到其他系统中。BANNER-CHEMDNER 在 CHEMDNER 化学实体提及(CEM)和化学文档索引(CDI)子任务的测试集中分别获得了 85.68%和 86.47%的 F 度量,在 BioCreative II 基因提及任务的官方测试集中获得了 87.04%的 F 度量,在化学和生物医学 NER 中都表现出了显著的性能。BANNER-CHEMDNER 系统可在以下网址获得:https://bitbucket.org/tsendeemts/banner-chemdner。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1877/4331699/6920c615be41/1758-2946-7-S1-S9-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验