Suppr超能文献

BCC-NER:用于基因/蛋白质提及识别的双向上下文线索命名实体标记器。

BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition.

作者信息

Murugesan Gurusamy, Abdulkadhar Sabenabanu, Bhasuran Balu, Natarajan Jeyakumar

机构信息

Data Mining and Text Mining Lab, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, 641046, India.

Center for Computational Biology, DRDO-BU Center for Life Sciences, Bharathiar University, Coimbatore, Tamilnadu, 641046, India.

出版信息

EURASIP J Bioinform Syst Biol. 2017 Dec;2017(1):7. doi: 10.1186/s13637-017-0060-6. Epub 2017 May 5.

Abstract

Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text processing which includes basic NLP pre-processing, feature extraction, and feature selection. The second module is for training and model building with bidirectional conditional random fields (CRF) to parse the text in both directions (forward and backward) and integrate the backward and forward trained models using margin-infused relaxed algorithm (MIRA). The third and final module is for post-processing to achieve a better performance, which includes surrounding text features, parenthesis mismatching, and two-tier abbreviation algorithm. The evaluation results on BioCreative II GM test corpus of BCC-NER achieve a precision of 89.95, recall of 84.15 and overall F-score of 86.95, which is higher than the other currently available open source taggers.

摘要

标记生物医学实体,如基因、蛋白质、细胞和细胞系,是生物医学文献挖掘的第一步,也是一个重要的先决条件。在本文中,我们描述了我们的混合命名实体标记方法,即BCC-NER(用于基因/蛋白质提及识别的双向、上下文线索命名实体标记器)。BCC-NER由三个模块组成。第一个模块用于文本处理,包括基本的自然语言处理预处理、特征提取和特征选择。第二个模块用于使用双向条件随机场(CRF)进行训练和模型构建,以双向(向前和向后)解析文本,并使用边际注入松弛算法(MIRA)整合向前和向后训练的模型。第三个也是最后一个模块用于后处理以获得更好的性能,包括周围文本特征、括号不匹配和两层缩写算法。BCC-NER在BioCreative II GM测试语料库上的评估结果达到了89.95的精确率、84.15的召回率和86.95的总体F值,高于其他目前可用的开源标记器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e7c/5419958/aba8223892c1/13637_2017_60_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验