Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, USA.
Department of Biochemistry and Biophysics, University of California San Francisco School of Medicine, San Francisco, California, USA.
Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.
Some medical schools have incorporated constructed response short answer questions (CR-SAQs) into their assessment toolkits. Although CR-SAQs carry benefits for medical students and educators, the faculty perception that the amount of time required to create and score CR-SAQs is not feasible and concerns about reliable scoring may impede the use of this assessment type in medical education.
Three US medical schools collaborated to write and score CR-SAQs based on a single vignette. Study participants included faculty question writers (N = 5) and three groups of scorers: faculty content experts (N = 7), faculty non-content experts (N = 6), and fourth-year medical students (N = 7). Structured interviews were performed with question writers and an online survey was administered to scorers to gather information about their process for creating and scoring CR-SAQs. A content analysis was performed on the qualitative data using Bowen's model of feasibility as a framework. To examine inter-rater reliability between the content expert and other scorers, a random selection of fifty student responses from each site were scored by each site's faculty content experts, faculty non-content experts, and student scorers. A holistic rubric (6-point Likert scale) was used by two schools and an analytic rubric (3-4 point checklist) was used by one school. Cohen's weighted kappa (κ) was used to evaluate inter-rater reliability.
This research study was implemented at three US medical schools that are nationally dispersed and have been administering CR-SAQ summative exams as part of their programs of assessment for at least five years. The study exam question was included in an end-of-course summative exam during the first year of medical school.
Five question writers (100%) participated in the interviews and twelve scorers (60% response rate) completed the survey. Qualitative comments revealed three aspects of feasibility: practicality (time, institutional culture, teamwork), implementation (steps in the question writing and scoring process), and adaptation (feedback, rubric adjustment, continuous quality improvement). The scorers' described their experience in terms of the need for outside resources, concern about lack of expertise, and value gained through scoring. Inter-rater reliability between the faculty content expert and student scorers was fair/moderate (κ=.34-.53, holistic rubrics) or substantial (κ=.67-.76, analytic rubric), but much lower between faculty content and non-content experts (κ=.18-.29, holistic rubrics; κ=.59-.66, analytic rubric).
Our findings show that from the faculty perspective it is feasible to include CR-SAQs in summative exams and we provide practical information for medical educators creating and scoring CR-SAQs. We also learned that CR-SAQs can be reliably scored by faculty without content expertise or senior medical students using an analytic rubric, or by senior medical students using a holistic rubric, which provides options to alleviate the faculty burden associated with grading CR-SAQs.
一些医学院校在其评估工具包中纳入了构造反应简答题(CR-SAQs)。虽然 CR-SAQs 对医学生和教育者有好处,但教师认为编写和评分 CR-SAQs 所需的时间不切实际,以及对可靠评分的担忧,可能会阻碍这种评估类型在医学教育中的应用。
三所美国医学院校合作编写并根据一个单一的病例进行了 CR-SAQ 的评分。研究参与者包括教师问题编写者(N=5)和三组评分者:教师内容专家(N=7)、教师非内容专家(N=6)和四年级医学生(N=7)。对问题编写者进行了结构化访谈,并对评分者进行了在线调查,以收集有关他们编写和评分 CR-SAQ 的过程的信息。使用 Bowen 的可行性模型作为框架对定性数据进行了内容分析。为了检查内容专家和其他评分者之间的评分者间可靠性,从每个站点随机选择五十名学生的答卷,由每个站点的教师内容专家、教师非内容专家和学生评分者进行评分。两所学校使用了整体评分表(6 分李克特量表),一所学校使用了分析评分表(3-4 分检查表)。使用 Cohen 的加权kappa(κ)评估评分者间可靠性。
这项研究在美国三所医学院实施,这些医学院分布在全国各地,并且已经将 CR-SAQ 总结性考试作为其评估计划的一部分至少五年了。该研究的考试问题包含在医学生第一年的课程结束总结性考试中。
五名问题编写者(100%)参加了访谈,十二名评分者(60%的回复率)完成了调查。定性评论揭示了可行性的三个方面:实用性(时间、机构文化、团队合作)、实施(问题编写和评分过程的步骤)和适应(反馈、评分表调整、持续质量改进)。评分者从需要外部资源、缺乏专业知识的担忧以及通过评分获得的价值等方面描述了他们的经验。教师内容专家和学生评分者之间的评分者间可靠性为适度/中等(κ=0.34-0.53,整体评分表)或较高(κ=0.67-0.76,分析评分表),但教师内容专家和非内容专家之间的可靠性要低得多(κ=0.18-0.29,整体评分表;κ=0.59-0.66,分析评分表)。
我们的研究结果表明,从教师的角度来看,在总结性考试中纳入 CR-SAQ 是可行的,并且为编写和评分 CR-SAQ 的医学教育工作者提供了实用信息。我们还了解到,使用分析评分表,没有内容专业知识或高年级医学生的 CR-SAQ 可以由教师可靠地评分,或者使用整体评分表,由高年级医学生可靠地评分,这为减轻与评分 CR-SAQ 相关的教师负担提供了选择。