一种用于标记人类蛋白质/基因的混合命名实体标记器。

A hybrid named entity tagger for tagging human proteins/genes.

作者信息

Raja Kalpana, Subramani Suresh, Natarajan Jeyakumar

出版信息

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

DOI:10.1504/ijdmb.2014.064545

Abstract

The predominant step and pre-requisite in the analysis of scientific literature is the extraction of gene/protein names in biomedical texts. Though many taggers are available for this Named Entity Recognition (NER) task, we found none of them achieve a good state-of-art tagging for human genes/proteins. As most of the current text mining research is related to human literature, a good tagger to precisely tag human genes and proteins is highly desirable. In this paper, we propose a new hybrid approach based on (a) machine learning algorithm (conditional random fields), (b) set of (manually constructed) rules, and (c) a novel abbreviation identification algorithm to surmount the common errors observed in available taggers to tag human genes/proteins. Experiment results on JNLPBA2004 corpus show that our domain specific approach achieves a high precision of 80.47, F-score of 75.77 and outperforms most of the state-of-the-art systems. However, the recall of 71.60 still remains low and leaves much room for future improvement.

摘要

科学文献分析中的主要步骤和前提是在生物医学文本中提取基因/蛋白质名称。尽管有许多标记器可用于此命名实体识别（NER）任务，但我们发现它们中没有一个能对人类基因/蛋白质实现良好的最新标记效果。由于当前大多数文本挖掘研究都与人类文献相关，因此非常需要一个能精确标记人类基因和蛋白质的优秀标记器。在本文中，我们提出了一种新的混合方法，该方法基于（a）机器学习算法（条件随机场）、（b）一组（手动构建的）规则以及（c）一种新颖的缩写识别算法，以克服现有标记器在标记人类基因/蛋白质时常见的错误。在JNLPBA2004语料库上的实验结果表明，我们的领域特定方法实现了80.47的高精度、75.77的F值，并且优于大多数最新系统。然而，71.60的召回率仍然较低，未来仍有很大的改进空间。