Lalor John P, Wu Hao, Yu Hong
University of Massachusetts, MA, USA.
Boston College, MA, USA.
Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:648-657. doi: 10.18653/v1/d16-1062.
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
对自然语言处理(NLP)方法的评估需要对照预先审核的黄金标准测试集进行测试,并报告标准指标(准确率/精确率/召回率/F1值)。当前的假设是,给定测试集中的所有项目在难度和区分能力方面都是相同的。我们提出将心理测量学中的项目反应理论(IRT)作为生成黄金标准测试集和评估NLP系统的另一种方法。IRT能够描述单个项目的特征——它们的难度和区分能力——并且在估计NLP任务中的人类智能或能力时能够考虑这些特征。在本文中,我们通过为文本蕴含识别生成一个黄金标准测试集来演示IRT。通过收集大量人类回答并拟合我们的IRT模型,我们表明我们的IRT模型将NLP系统与人类群体中的表现进行比较,并且能够比标准评估指标更深入地洞察系统性能。我们表明,高准确率分数并不总是意味着高IRT分数,这取决于项目特征和回答模式。