Torii Manabu, Hu Zhangzhi, Wu Cathy H, Liu Hongfang
The Imaging Science and Information Systems Center, Department of Oncology, Georgetown University Medical Center, 2115 Wisconsin Avenue NW, Washington, DC 20057, USA.
J Am Med Inform Assoc. 2009 Mar-Apr;16(2):247-55. doi: 10.1197/jamia.M2844. Epub 2008 Dec 11.
Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination.
BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems.
The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure.
BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus.
The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.
生物医学命名实体识别(BNER)是从自由文本中挖掘生物医学知识的自动化系统的关键组成部分。在该领域的不同类型实体中,基因/蛋白质是BNER研究最多的一种。我们的目标是开发一个基因/蛋白质名称识别系统BioTagger-GM,该系统利用强大的机器学习框架和系统组合,从术语源中挖掘丰富信息。
BioTagger-GM由四个主要组件组成:(1)字典查找——在文本中标记BioThesaurus中的基因/蛋白质名称和UMLS元词表中的生物医学术语;(2)机器学习——使用字典查找结果作为一种特征来训练机器学习系统;(3)后处理——使用启发式规则纠正识别错误;(4)系统组合——使用投票方案组合多个系统的识别结果。
使用BioCreAtIvE II基因提及(GM)语料库来评估所提出的方法。为了测试其一般适用性,还在为基因/蛋白质名称识别而修改的JNLPBA语料库上对该方法进行了评估。通过交叉验证测试评估系统的性能,并使用精确率、召回率和F值进行衡量。
BioTagger-GM在BioCreAtIvE II GM语料库上的F值达到了0.8887,高于BioCreAtIvE II挑战赛中排名第一的系统。该方法在修改后的JNLPBA语料库上的适用性也得到了证实。
结果表明,可以将术语源、强大的机器学习框架和系统组合集成起来,构建一个有效的BNER系统。