• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Building an Evaluation Scale using Item Response Theory.运用项目反应理论构建评估量表。
Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:648-657. doi: 10.18653/v1/d16-1062.
2
[The estimation of premorbid intelligence levels in French speakers].[法语使用者病前智力水平的评估]
Encephale. 2005 Jan-Feb;31(1 Pt 1):31-43. doi: 10.1016/s0013-7006(05)82370-x.
3
The Impact of Test and Sample Characteristics on Model Selection and Classification Accuracy in the Multilevel Mixture IRT Model.测试与样本特征对多级混合IRT模型中模型选择及分类准确性的影响
Front Psychol. 2020 Feb 14;11:197. doi: 10.3389/fpsyg.2020.00197. eCollection 2020.
4
The relationship between classical item characteristics and item response time on computer-based testing.基于计算机测试中经典项目特征与项目反应时间之间的关系。
Korean J Med Educ. 2019 Mar;31(1):1-9. doi: 10.3946/kjme.2019.113. Epub 2019 Mar 1.
5
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds.无需人类反应模式学习潜在参数:基于人工群体的项目反应理论
Proc Conf Empir Methods Nat Lang Process. 2019 Nov;2019:4240-4250. doi: 10.18653/v1/D19-1434.
6
THE DEPRESSION INVENTORY DEVELOPMENT SCALE: Assessment of Psychometric Properties Using Classical and Modern Measurement Theory in a CAN-BIND Trial.抑郁量表发展量表:在CAN - BIND试验中使用经典和现代测量理论对心理测量特性进行评估
Innov Clin Neurosci. 2020 Jul 1;17(7-9):30-40.
7
Methodological issues regarding power of classical test theory (CTT) and item response theory (IRT)-based approaches for the comparison of patient-reported outcomes in two groups of patients--a simulation study.关于经典测量理论(CTT)和项目反应理论(IRT)方法在两组患者间比较患者报告结局的功效的方法学问题——一项模拟研究。
BMC Med Res Methodol. 2010 Mar 25;10:24. doi: 10.1186/1471-2288-10-24.
8
Evaluating Equating Transformations in IRT Observed-Score and Kernel Equating Methods.评估IRT观测分数等值转换及核等值方法
Appl Psychol Meas. 2023 Mar;47(2):123-140. doi: 10.1177/01466216221124087. Epub 2022 Oct 4.
9
Item response theory analysis of cognitive tests in people with dementia: a systematic review.认知测验在痴呆患者中的项目反应理论分析:系统评价。
BMC Psychiatry. 2014 Feb 19;14:47. doi: 10.1186/1471-244X-14-47.
10
Relationships Among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models.通过因子分析模型探讨经典测试理论与项目反应理论框架之间的关系。
Educ Psychol Meas. 2015 Jun;75(3):389-405. doi: 10.1177/0013164414559071. Epub 2014 Nov 20.

引用本文的文献

1
ALBA: Adaptive Language-Based Assessments for Mental Health.ALBA:心理健康的适应性语言评估
Proc Conf. 2024 Jun;2024:2466-2478. doi: 10.18653/v1/2024.naacl-long.136.
2
Assessing key soft skills in organizational contexts: development and validation of the multiple soft skills assessment tool.评估组织环境中的关键软技能:多种软技能评估工具的开发与验证
Front Psychol. 2024 Oct 25;15:1405822. doi: 10.3389/fpsyg.2024.1405822. eCollection 2024.
3
IRTCI: Item Response Theory for Categorical Imputation.IRTCI:用于分类插补的项目反应理论
Res Sq. 2024 Jul 2:rs.3.rs-4529519. doi: 10.21203/rs.3.rs-4529519/v1.
4
Knowledge, attitude, and practices to zoonotic disease risks from livestock birth products among smallholder communities in Ethiopia.埃塞俄比亚小农户社区对家畜分娩产物所带来的人畜共患病风险的认知、态度和行为。
One Health. 2021 Jan 30;12:100223. doi: 10.1016/j.onehlt.2021.100223. eCollection 2021 Jun.
5
Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study.通过检验测试集难度理解深度学习性能:一项心理测量案例研究。
Proc Conf Empir Methods Nat Lang Process. 2018 Oct-Nov;2018:4711-4716. doi: 10.18653/v1/d18-1500.
6
Using Item Response Theory for Explainable Machine Learning in Predicting Mortality in the Intensive Care Unit: Case-Based Approach.应用项目反应理论进行可解释机器学习预测重症监护病房死亡率:基于案例的方法。
J Med Internet Res. 2020 Sep 25;22(9):e20268. doi: 10.2196/20268.
7
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds.无需人类反应模式学习潜在参数:基于人工群体的项目反应理论
Proc Conf Empir Methods Nat Lang Process. 2019 Nov;2019:4240-4250. doi: 10.18653/v1/D19-1434.
8
Improving Electronic Health Record Note Comprehension With NoteAid: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Crowdsourced Workers.使用NoteAid提高电子健康记录笔记的理解能力:针对众包工作者的电子健康记录笔记理解干预措施的随机试验
J Med Internet Res. 2019 Jan 16;21(1):e10793. doi: 10.2196/10793.

本文引用的文献

1
Mastering the game of Go with deep neural networks and tree search.用深度神经网络和树搜索掌握围棋游戏。
Nature. 2016 Jan 28;529(7587):484-9. doi: 10.1038/nature16961.

运用项目反应理论构建评估量表。

Building an Evaluation Scale using Item Response Theory.

作者信息

Lalor John P, Wu Hao, Yu Hong

机构信息

University of Massachusetts, MA, USA.

Boston College, MA, USA.

出版信息

Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:648-657. doi: 10.18653/v1/d16-1062.

DOI:10.18653/v1/d16-1062
PMID:28004039
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5167538/
Abstract

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

摘要

对自然语言处理(NLP)方法的评估需要对照预先审核的黄金标准测试集进行测试,并报告标准指标(准确率/精确率/召回率/F1值)。当前的假设是,给定测试集中的所有项目在难度和区分能力方面都是相同的。我们提出将心理测量学中的项目反应理论(IRT)作为生成黄金标准测试集和评估NLP系统的另一种方法。IRT能够描述单个项目的特征——它们的难度和区分能力——并且在估计NLP任务中的人类智能或能力时能够考虑这些特征。在本文中,我们通过为文本蕴含识别生成一个黄金标准测试集来演示IRT。通过收集大量人类回答并拟合我们的IRT模型,我们表明我们的IRT模型将NLP系统与人类群体中的表现进行比较,并且能够比标准评估指标更深入地洞察系统性能。我们表明,高准确率分数并不总是意味着高IRT分数,这取决于项目特征和回答模式。