• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多机构研究构建反应考试问题实施的可行性和可靠性。

A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions.

机构信息

Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, USA.

Department of Biochemistry and Biophysics, University of California San Francisco School of Medicine, San Francisco, California, USA.

出版信息

Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.

DOI:10.1080/10401334.2022.2111571
PMID:35989668
Abstract

PROBLEM

Some medical schools have incorporated constructed response short answer questions (CR-SAQs) into their assessment toolkits. Although CR-SAQs carry benefits for medical students and educators, the faculty perception that the amount of time required to create and score CR-SAQs is not feasible and concerns about reliable scoring may impede the use of this assessment type in medical education.

INTERVENTION

Three US medical schools collaborated to write and score CR-SAQs based on a single vignette. Study participants included faculty question writers (N = 5) and three groups of scorers: faculty content experts (N = 7), faculty non-content experts (N = 6), and fourth-year medical students (N = 7). Structured interviews were performed with question writers and an online survey was administered to scorers to gather information about their process for creating and scoring CR-SAQs. A content analysis was performed on the qualitative data using Bowen's model of feasibility as a framework. To examine inter-rater reliability between the content expert and other scorers, a random selection of fifty student responses from each site were scored by each site's faculty content experts, faculty non-content experts, and student scorers. A holistic rubric (6-point Likert scale) was used by two schools and an analytic rubric (3-4 point checklist) was used by one school. Cohen's weighted kappa (κ) was used to evaluate inter-rater reliability.

CONTEXT

This research study was implemented at three US medical schools that are nationally dispersed and have been administering CR-SAQ summative exams as part of their programs of assessment for at least five years. The study exam question was included in an end-of-course summative exam during the first year of medical school.

IMPACT

Five question writers (100%) participated in the interviews and twelve scorers (60% response rate) completed the survey. Qualitative comments revealed three aspects of feasibility: practicality (time, institutional culture, teamwork), implementation (steps in the question writing and scoring process), and adaptation (feedback, rubric adjustment, continuous quality improvement). The scorers' described their experience in terms of the need for outside resources, concern about lack of expertise, and value gained through scoring. Inter-rater reliability between the faculty content expert and student scorers was fair/moderate (κ=.34-.53, holistic rubrics) or substantial (κ=.67-.76, analytic rubric), but much lower between faculty content and non-content experts (κ=.18-.29, holistic rubrics; κ=.59-.66, analytic rubric).

LESSONS LEARNED

Our findings show that from the faculty perspective it is feasible to include CR-SAQs in summative exams and we provide practical information for medical educators creating and scoring CR-SAQs. We also learned that CR-SAQs can be reliably scored by faculty without content expertise or senior medical students using an analytic rubric, or by senior medical students using a holistic rubric, which provides options to alleviate the faculty burden associated with grading CR-SAQs.

摘要

问题

一些医学院校在其评估工具包中纳入了构造反应简答题(CR-SAQs)。虽然 CR-SAQs 对医学生和教育者有好处,但教师认为编写和评分 CR-SAQs 所需的时间不切实际,以及对可靠评分的担忧,可能会阻碍这种评估类型在医学教育中的应用。

干预措施

三所美国医学院校合作编写并根据一个单一的病例进行了 CR-SAQ 的评分。研究参与者包括教师问题编写者(N=5)和三组评分者:教师内容专家(N=7)、教师非内容专家(N=6)和四年级医学生(N=7)。对问题编写者进行了结构化访谈,并对评分者进行了在线调查,以收集有关他们编写和评分 CR-SAQ 的过程的信息。使用 Bowen 的可行性模型作为框架对定性数据进行了内容分析。为了检查内容专家和其他评分者之间的评分者间可靠性,从每个站点随机选择五十名学生的答卷,由每个站点的教师内容专家、教师非内容专家和学生评分者进行评分。两所学校使用了整体评分表(6 分李克特量表),一所学校使用了分析评分表(3-4 分检查表)。使用 Cohen 的加权kappa(κ)评估评分者间可靠性。

背景

这项研究在美国三所医学院实施,这些医学院分布在全国各地,并且已经将 CR-SAQ 总结性考试作为其评估计划的一部分至少五年了。该研究的考试问题包含在医学生第一年的课程结束总结性考试中。

影响

五名问题编写者(100%)参加了访谈,十二名评分者(60%的回复率)完成了调查。定性评论揭示了可行性的三个方面:实用性(时间、机构文化、团队合作)、实施(问题编写和评分过程的步骤)和适应(反馈、评分表调整、持续质量改进)。评分者从需要外部资源、缺乏专业知识的担忧以及通过评分获得的价值等方面描述了他们的经验。教师内容专家和学生评分者之间的评分者间可靠性为适度/中等(κ=0.34-0.53,整体评分表)或较高(κ=0.67-0.76,分析评分表),但教师内容专家和非内容专家之间的可靠性要低得多(κ=0.18-0.29,整体评分表;κ=0.59-0.66,分析评分表)。

经验教训

我们的研究结果表明,从教师的角度来看,在总结性考试中纳入 CR-SAQ 是可行的,并且为编写和评分 CR-SAQ 的医学教育工作者提供了实用信息。我们还了解到,使用分析评分表,没有内容专业知识或高年级医学生的 CR-SAQ 可以由教师可靠地评分,或者使用整体评分表,由高年级医学生可靠地评分,这为减轻与评分 CR-SAQ 相关的教师负担提供了选择。

相似文献

1
A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions.多机构研究构建反应考试问题实施的可行性和可靠性。
Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.
2
What have we learned about constructed response short-answer questions from students and faculty? A multi-institutional study.我们从学生和教师那里了解到了哪些关于构造性反应简答题的信息?一项多机构研究。
Med Teach. 2024 Mar;46(3):349-358. doi: 10.1080/0142159X.2023.2249209. Epub 2023 Sep 9.
3
Should multiple-choice questions get the SAQ? Development of a short-answer question writing rubric.多选题应纳入 SAQ 吗?短答题写作评分标准的制定。
Curr Pharm Teach Learn. 2022 May;14(5):591-596. doi: 10.1016/j.cptl.2022.04.004. Epub 2022 May 7.
4
Development and Validation of a Tool to Evaluate the Evolution of Clinical Reasoning in Trauma Using Virtual Patients.开发并验证一种使用虚拟患者评估创伤临床推理演变的工具。
J Surg Educ. 2018 May-Jun;75(3):779-786. doi: 10.1016/j.jsurg.2017.08.024. Epub 2017 Sep 18.
5
Patients don't come with multiple choice options: essay-based assessment in UME.医学生学业评估中的论述题考试:患者没有多项选择。
Med Educ Online. 2019 Dec;24(1):1649959. doi: 10.1080/10872981.2019.1649959.
6
Holistic rubric vs. analytic rubric for measuring clinical performance levels in medical students.整体评分标准与分析性评分标准在医学生临床能力评估中的比较。
BMC Med Educ. 2018 Jun 5;18(1):124. doi: 10.1186/s12909-018-1228-9.
7
Development and Validation of a Rubric to Evaluate Diabetes SOAP Note Writing in APPE.发展和验证用于评估 APPE 中糖尿病 SOAP 医嘱记录的评分表。
Am J Pharm Educ. 2018 Nov;82(9):6725. doi: 10.5688/ajpe6725.
8
Use of an Analytical Grading Rubric for Self-Assessment: A Pilot Study for a Periodontal Oral Competency Examination in Predoctoral Dental Education.使用分析性评分量表进行自我评估:口腔医学预科教育中牙周口腔能力考试的一项试点研究。
J Dent Educ. 2015 Dec;79(12):1429-36.
9
Assessment to Optimize Learning Strategies: A Qualitative Study of Student and Faculty Perceptions.学习策略优化评估:学生和教师感知的定性研究。
Teach Learn Med. 2021 Jun-Jul;33(3):245-257. doi: 10.1080/10401334.2020.1852940. Epub 2021 Jan 13.
10
Developing, evaluating and validating a scoring rubric for written case reports.开发、评估和验证书面病例报告的评分标准。
Int J Med Educ. 2014 Feb 1;5:18-23. doi: 10.5116/ijme.52c6.d7ef.

引用本文的文献

1
Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.评估大型语言模型作为医学简答题评分者:与专家人工评分者的比较分析。
Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.
2
Performance of ChatGPT on optometry and vision science exam questions.ChatGPT在验光与视觉科学考试问题上的表现。
Ophthalmic Physiol Opt. 2025 Sep;45(6):1376-1388. doi: 10.1111/opo.13544. Epub 2025 Jul 9.
3
Integration of physiology in a curriculum on human structure: a snapshot of the cardiovascular block.
人体结构课程中生理学的整合:心血管模块概述
Front Physiol. 2023 Jul 14;14:1236409. doi: 10.3389/fphys.2023.1236409. eCollection 2023.