运用项目反应理论构建评估量表。

Building an Evaluation Scale using Item Response Theory.

作者信息

Lalor John P, Wu Hao, Yu Hong

机构信息

University of Massachusetts, MA, USA.

Boston College, MA, USA.

出版信息

Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:648-657. doi: 10.18653/v1/d16-1062.

DOI:10.18653/v1/d16-1062

PMID:28004039

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5167538/

Abstract

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

摘要

对自然语言处理（NLP）方法的评估需要对照预先审核的黄金标准测试集进行测试，并报告标准指标（准确率/精确率/召回率/F1值）。当前的假设是，给定测试集中的所有项目在难度和区分能力方面都是相同的。我们提出将心理测量学中的项目反应理论（IRT）作为生成黄金标准测试集和评估NLP系统的另一种方法。IRT能够描述单个项目的特征——它们的难度和区分能力——并且在估计NLP任务中的人类智能或能力时能够考虑这些特征。在本文中，我们通过为文本蕴含识别生成一个黄金标准测试集来演示IRT。通过收集大量人类回答并拟合我们的IRT模型，我们表明我们的IRT模型将NLP系统与人类群体中的表现进行比较，并且能够比标准评估指标更深入地洞察系统性能。我们表明，高准确率分数并不总是意味着高IRT分数，这取决于项目特征和回答模式。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

运用项目反应理论构建评估量表。

Building an Evaluation Scale using Item Response Theory.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

运用项目反应理论构建评估量表。

Building an Evaluation Scale using Item Response Theory.

作者信息

机构信息

出版信息