University of Michigan Medical School, Ann Arbor, Michigan.
Center for Surgical Training and Research, Department of Surgery, University of Michigan, Ann Arbor, Michigan.
J Surg Educ. 2019 Nov-Dec;76(6):e189-e192. doi: 10.1016/j.jsurg.2019.07.008. Epub 2019 Sep 6.
The profession of surgery is entering a new era of "big data," where analyses of longitudinal trainee assessment data will be used to inform ongoing efforts to improve surgical education. Given the high-stakes implications of these types of analyses, researchers must define the conditions under which estimates derived from these large datasets remain valid. With this study, we determine the number of assessments of residents' performances needed to reliably assess the difficulty of "Core" surgical procedures.
Using the SIMPL smartphone application from the Procedural Learning and Safety Collaborative, 402 attending surgeons directly observed and provided workplace-based assessments for 488 categorical residents after 5259 performances of 87 Core surgical procedures performed at 14 institutions. We used these faculty ratings to construct a linear mixed model with resident performance as the outcome variable and multiple predictors including, most significantly, the operative procedure as a random effect. We interpreted the variance in performance ratings attributable to the procedure, after controlling for other variables, as the "difficulty" of performing the procedure. We conducted a generalizability analysis and decision study to estimate the number of SIMPL performance ratings needed to reliably estimate the difficulty of a typical Core procedure.
Twenty-four faculty ratings of resident operative performance were necessary to reliably estimate the difficulty of a typical Core surgical procedure (mean dependability coefficient 0.80, 95% confidence interval 0.73-0.87).
At least 24 operative performance ratings are required to reliably estimate the difficulty of a typical Core surgical procedure. Future research using performance ratings to establish procedure difficulty should include adequate numbers of ratings given the high-stakes implications of those results for curriculum design and policy.
外科专业正进入一个“大数据”的新时代,对学员评估数据的纵向分析将用于指导持续努力改进外科教育。鉴于此类分析的高风险影响,研究人员必须定义从这些大型数据集得出的估计值保持有效的条件。在这项研究中,我们确定了评估住院医师绩效的评估数量,以可靠地评估“核心”手术程序的难度。
使用来自程序学习和安全协作的 SIMPL 智能手机应用程序,402 名主治外科医生在 14 家机构进行的 5259 次 87 项核心手术中直接观察并提供了 488 名分类住院医师的基于工作场所的评估。我们使用这些教师评分来构建一个线性混合模型,以住院医师的表现为因变量,包括操作程序在内的多个预测因子为随机效应。我们在控制其他变量后,将表现评分的差异归因于程序,将其解释为执行程序的“难度”。我们进行了可概括性分析和决策研究,以估计需要多少 SIMPL 表现评分才能可靠地估计典型核心程序的难度。
要可靠地估计典型核心手术程序的难度,需要 24 名教师对住院医师手术表现的评估(平均可靠性系数 0.80,95%置信区间 0.73-0.87)。
至少需要 24 次手术表现评估才能可靠地估计典型核心手术程序的难度。未来使用表现评估来确定程序难度的研究应根据这些结果对课程设计和政策的高风险影响,包含足够数量的评估。