Suppr超能文献

一种用于标记人类蛋白质/基因的混合命名实体标记器。

A hybrid named entity tagger for tagging human proteins/genes.

作者信息

Raja Kalpana, Subramani Suresh, Natarajan Jeyakumar

出版信息

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

Abstract

The predominant step and pre-requisite in the analysis of scientific literature is the extraction of gene/protein names in biomedical texts. Though many taggers are available for this Named Entity Recognition (NER) task, we found none of them achieve a good state-of-art tagging for human genes/proteins. As most of the current text mining research is related to human literature, a good tagger to precisely tag human genes and proteins is highly desirable. In this paper, we propose a new hybrid approach based on (a) machine learning algorithm (conditional random fields), (b) set of (manually constructed) rules, and (c) a novel abbreviation identification algorithm to surmount the common errors observed in available taggers to tag human genes/proteins. Experiment results on JNLPBA2004 corpus show that our domain specific approach achieves a high precision of 80.47, F-score of 75.77 and outperforms most of the state-of-the-art systems. However, the recall of 71.60 still remains low and leaves much room for future improvement.

摘要

科学文献分析中的主要步骤和前提是在生物医学文本中提取基因/蛋白质名称。尽管有许多标记器可用于此命名实体识别(NER)任务,但我们发现它们中没有一个能对人类基因/蛋白质实现良好的最新标记效果。由于当前大多数文本挖掘研究都与人类文献相关,因此非常需要一个能精确标记人类基因和蛋白质的优秀标记器。在本文中,我们提出了一种新的混合方法,该方法基于(a)机器学习算法(条件随机场)、(b)一组(手动构建的)规则以及(c)一种新颖的缩写识别算法,以克服现有标记器在标记人类基因/蛋白质时常见的错误。在JNLPBA2004语料库上的实验结果表明,我们的领域特定方法实现了80.47的高精度、75.77的F值,并且优于大多数最新系统。然而,71.60的召回率仍然较低,未来仍有很大的改进空间。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验