Friedman C, Hripcsak G
Department of Computer Science, Queens College CUNY, New York, USA.
Methods Inf Med. 1998 Nov;37(4-5):334-44.
Evaluating natural language processing (NLP) systems in the clinical domain is a difficult task which is important for advancement of the field. A number of NLP systems have been reported that extract information from free-text clinical reports, but not many of the systems have been evaluated. Those that were evaluated noted good performance measures but the results were often weakened by ineffective evaluation methods. In this paper we describe a set of criteria aimed at improving the quality of NLP evaluation studies. We present an overview of NLP evaluations in the clinical domain and also discuss the Message Understanding Conferences (MUC) [1-4]. Although these conferences constitute a series of NLP evaluation studies performed outside of the clinical domain, some of the results are relevant within medicine. In addition, we discuss a number of factors which contribute to the complexity that is inherent in the task of evaluating natural language systems.
评估临床领域的自然语言处理(NLP)系统是一项艰巨的任务,对该领域的发展至关重要。已有许多NLP系统被报道可从自由文本临床报告中提取信息,但对这些系统进行评估的并不多。那些经过评估的系统显示出良好的性能指标,但结果往往因无效的评估方法而受到影响。在本文中,我们描述了一套旨在提高NLP评估研究质量的标准。我们概述了临床领域的NLP评估,并讨论了信息理解会议(MUC)[1-4]。尽管这些会议构成了一系列在临床领域之外进行的NLP评估研究,但其中一些结果在医学领域具有相关性。此外,我们讨论了一些导致评估自然语言系统任务固有复杂性的因素。