通过过滤和整合策略提高蛋白质和基因名称识别器的性能。

Enhancing performance of protein and gene name recognizers with filtering and integration strategies.

作者信息

Hou Wen-Juan, Chen Hsin-Hsi

机构信息

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC.

出版信息

J Biomed Inform. 2004 Dec;37(6):448-60. doi: 10.1016/j.jbi.2004.08.006.

DOI:10.1016/j.jbi.2004.08.006

PMID:15542018

Abstract

Named entity (NE) recognition is a fundamental task in biological relationship mining. This paper considers protein/gene collocates extracted from biological corpora as restrictions to enhance the precision rate of protein/gene name recognition. In addition, we integrate the results of multiple NE recognizers to improve the recall rates. Yapex and KeX, and ABGene and Idgene are taken as examples of protein and gene name recognizers, respectively. The precision of Yapex increases from 70.90 to 85.84% at the low expense of the recall rate (i.e., it only decreases 2.44%) when collocates are incorporated. When both filtering and integration strategies are employed together, the Yapex-based integration with KeX shows good performance, i.e., the F-score increases by 7.83% compared to the pure Yapex method. The results of gene recognition show the same tendency. The ABGene-based integration with Idgene shows a 10.18% F-score increase compared to the pure ABGene method. These successful methodologies can be easily extended to other name finders in biological documents.

摘要

命名实体（NE）识别是生物关系挖掘中的一项基本任务。本文将从生物语料库中提取的蛋白质/基因搭配视为一种限制条件，以提高蛋白质/基因名称识别的精确率。此外，我们整合了多个命名实体识别器的结果以提高召回率。分别以Yapex和KeX以及ABGene和Idgene作为蛋白质和基因名称识别器的示例。当纳入搭配时，Yapex的精确率从70.90%提高到了85.84%，而召回率仅略有下降（即仅下降了2.44%）。当同时采用过滤和整合策略时，基于Yapex与KeX的整合显示出良好的性能，即与纯Yapex方法相比，F值提高了7.83%。基因识别的结果也呈现出相同的趋势。与纯ABGene方法相比，基于ABGene与Idgene的整合显示F值提高了10.18%。这些成功的方法可以很容易地扩展到生物文档中的其他名称查找器。