Hirschman Lynette, Morgan Alexander A, Yeh Alexander S
The MITRE Corporation, MS K312, 202 Burlington Rd., Bedford, MA 01730, USA.
J Biomed Inform. 2002 Aug;35(4):247-59. doi: 10.1016/s1532-0464(03)00014-5.
As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93-95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75-80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.
随着生物学研究步伐的加快,生物学家越来越依赖计算机来应对信息爆炸。生物学家通过使用精确的生物学术语来交流他们的研究成果;这些术语随后为文献以及越来越多的生物学数据库提供索引。本文探讨了通过提取实体名称及其之间的关系来获取生物资源的新兴技术。信息提取一直是自然语言处理领域的一个活跃研究方向,并且将信息提取应用于新闻报道已取得了有前景的成果,例如,在识别人员、组织和地点名称方面,平衡精确率和召回率在93% - 95%的范围内。但这些结果似乎并不能直接应用于生物名称,其结果仍在75% - 80%的范围内。可能涉及多个因素,包括缺乏用于严格衡量进展的共享训练集和测试集、缺乏针对生物学任务的标注训练数据、术语普遍存在的歧义性、新术语的频繁引入,以及新闻报道定义的评估任务与实际生物学问题之间的不匹配。我们通过一个简单的词汇匹配练习展示了在识别生物名称时遇到的一些具体问题。最后,我们概述了一项研究议程,以将命名实体标记的性能提高到可用于执行具有生物学重要性任务的水平。