提高生物医学文本中基因和蛋白质的命名实体识别准确性。

Improving named entity recognition accuracy for gene and protein in biomedical text literature.

作者信息

Tohidi Hossein, Ibrahim Hamidah, Murad Masrah Azrifah Azmi

出版信息

Int J Data Min Bioinform. 2014;10(3):239-68. doi: 10.1504/ijdmb.2014.064523.

DOI:10.1504/ijdmb.2014.064523

Abstract

The task of recognising biomedical named entities in natural language documents called biomedical Named Entity Recognition (NER) is the focus of many researchers due to complex nature of such texts. This complexity includes the issues of character-level, word-level and word order variations. In this study, an approach for recognising gene and protein names that handles the above issues is proposed. Similar to the previous related works, our approach is based on the assumption that a named entity occurs within a noun group. The strength of our proposed approach lies on a Statistical Character-based Syntax Similarity (SCSS) algorithm which measures similarity between the extracted candidates and the well-known biomedical named entities from the GENIA V3.0 corpus. The proposed approach is evaluated and results are satisfied. For recognitions of both gene and protein names, we achieved 97.2% for precision (P), 95.2% for recall (R), and 96.1 for F-measure. While for protein names recognition we gained 98.1% for P, 97.5% for R and 97.7 for F-measure.

摘要

在自然语言文档中识别生物医学命名实体的任务，即生物医学命名实体识别（NER），由于此类文本的复杂性，成为了许多研究人员关注的焦点。这种复杂性包括字符级、单词级和词序变化等问题。在本研究中，提出了一种用于识别基因和蛋白质名称的方法，该方法能够处理上述问题。与之前的相关工作类似，我们的方法基于这样一种假设，即命名实体出现在名词组中。我们提出的方法的优势在于一种基于统计字符的句法相似度（SCSS）算法，该算法用于衡量提取的候选实体与来自GENIA V3.0语料库的知名生物医学命名实体之间的相似度。对提出的方法进行了评估，结果令人满意。对于基因和蛋白质名称的识别，我们的精确率（P）达到了97.2%，召回率（R）达到了95.2%，F值为96.1。而对于蛋白质名称的识别，精确率（P）为98.1%，召回率（R）为97.5%，F值为97.7。