Hurley Katrina F, Giffin Nick A, Stewart Samuel A, Bullock Graham B
Department of Emergency Medicine, Dalhousie University, Halifax, NS, Canada;
Bachelor of Medicine Class of 2016, Dalhousie University, Halifax, NS, Canada.
Med Educ Online. 2015 Oct 20;20:29242. doi: 10.3402/meo.v20.29242. eCollection 2015.
The Objective Structured Clinical Examination (OSCE) is a widely employed tool for measuring clinical competence. In the drive toward comprehensive assessment, OSCE stations and checklists may become increasingly complex. The objective of this study was to probe inter-observer reliability and observer accuracy as a function of OSCE checklist length.
Study participants included emergency physicians and senior residents in Emergency Medicine at Dalhousie University. Participants watched an identical series of four, scripted, standardized videos enacting 10-min OSCE stations and completed corresponding assessment checklists. Each participating observer was provided with a random combination of two 40-item and two 20-item checklists. A panel of physicians scored the scenarios through repeated video review to determine the 'gold standard' checklist scores.
Fifty-seven observers completed 228 assessment checklists. Mean observer accuracy ranged from 73 to 93% (14.6-18.7/20), with an overall accuracy of 86% (17.2/20), and inter-rater reliability range of 58-78%. After controlling for station and individual variation, no effect was observed regarding the number of checklist items on overall accuracy (p=0.2305). Consistency in ratings was calculated using intraclass correlation coefficient and demonstrated no significant difference in consistency between the 20- and 40-item checklists (ranged from 0.432 to 0.781, p-values from 0.56 to 0.73).
The addition of 20 checklist items to a core list of 20 items in an OSCE assessment checklist does not appear to impact observer accuracy or inter-rater reliability.
客观结构化临床考试(OSCE)是一种广泛用于评估临床能力的工具。在追求全面评估的过程中,OSCE考站和检查表可能会变得越来越复杂。本研究的目的是探讨作为OSCE检查表长度函数的观察者间可靠性和观察者准确性。
研究参与者包括达尔豪斯大学的急诊医生和急诊医学高级住院医师。参与者观看了一系列相同的四段、有脚本的、标准化的视频,这些视频模拟了10分钟的OSCE考站,并完成了相应的评估检查表。为每位参与的观察者提供了两份40项检查表和两份20项检查表的随机组合。一组医生通过反复观看视频对场景进行评分,以确定“金标准”检查表分数。
57名观察者完成了228份评估检查表。观察者的平均准确率在73%至93%(14.6 - 18.7/20)之间,总体准确率为86%(17.2/20),评分者间可靠性范围为58% - 78%。在控制了考站和个体差异后,未观察到检查表项目数量对总体准确性有影响(p = 0.2305)。使用组内相关系数计算评分一致性,结果表明20项和40项检查表之间的一致性无显著差异(范围为0.432至0.781,p值为0.56至0.73)。
在OSCE评估检查表的20项核心列表中增加20项检查表项目似乎不会影响观察者准确性或评分者间可靠性。