Department of Ophthalmology and Visual Sciences, Washington University School of Medicine, Saint Louis, Missouri.
Graduate Medical Education, University of Minnesota, Minneapolis, Minnesota.
J Surg Educ. 2021 Jul-Aug;78(4):1077-1088. doi: 10.1016/j.jsurg.2021.02.004. Epub 2021 Feb 25.
To test whether crowdsourced lay raters can accurately assess cataract surgical skills.
Two-armed study: independent cross-sectional and longitudinal cohorts.
Washington University Department of Ophthalmology.
Sixteen cataract surgeons with varying experience levels submitted cataract surgery videos to be graded by 5 experts and 300+ crowdworkers masked to surgeon experience. Cross-sectional study: 50 videos from surgeons ranging from first-year resident to attending physician, pooled by years of training. Longitudinal study: 28 videos obtained at regular intervals as residents progressed through 180 cases. Surgical skill was graded using the modified Objective Structured Assessment of Technical Skill (mOSATS). Main outcome measures were overall technical performance, reliability indices, and correlation between expert and crowd mean scores.
Experts demonstrated high interrater reliability and accurately predicted training level, establishing construct validity for the modified OSATS. Crowd scores were correlated with (r = 0.865, p < 0.0001) but consistently higher than expert scores for first, second, and third-year residents (p < 0.0001, paired t-test). Longer surgery duration negatively correlated with training level (r = -0.855, p < 0.0001) and expert score (r = -0.927, p < 0.0001). The longitudinal dataset reproduced cross-sectional study findings for crowd and expert comparisons. A regression equation transforming crowd score plus video length into expert score was derived from the cross-sectional dataset (r = 0.92) and demonstrated excellent predictive modeling when applied to the independent longitudinal dataset (r = 0.80). A group of student raters who had edited the cataract videos also graded them, producing scores that more closely approximated experts than the crowd.
Crowdsourced rankings correlated with expert scores, but were not equivalent; crowd scores overestimated technical competency, especially for novice surgeons. A novel approach of adjusting crowd scores with surgery duration generated a more accurate predictive model for surgical skill. More studies are needed before crowdsourcing can be reliably used for assessing surgical proficiency.
测试众包的外行评估者是否能准确评估白内障手术技能。
双臂研究:独立的横断面和纵向队列。
华盛顿大学眼科系。
16 名白内障外科医生,他们具有不同的经验水平,向 5 名专家和 300 多名对手术医生经验不知情的众包人员提交白内障手术视频进行评分。横断面研究:对从第一年住院医师到主治医生的外科医生的 50 个视频进行分组,这些视频按培训年限进行分组。纵向研究:作为住院医师完成 180 例手术的常规间隔时间获取 28 个视频。使用改良的客观结构化手术技能评估(mOSATS)来评估手术技能。主要观察指标是总体技术表现、可靠性指标以及专家和众包平均分数之间的相关性。
专家表现出高度的组内可靠性,并准确预测了培训水平,从而建立了改良 OSATS 的结构有效性。众包分数与(r=0.865,p<0.0001)相关,但与第一、第二和第三年住院医师的专家评分相比始终较高(p<0.0001,配对 t 检验)。手术时间延长与培训水平(r=-0.855,p<0.0001)和专家评分(r=-0.927,p<0.0001)呈负相关。纵向数据集再现了众包和专家比较的横断面研究结果。从横断面数据集推导出了一个将众包分数加上视频长度转化为专家分数的回归方程(r=0.92),并将其应用于独立的纵向数据集时,表现出了极好的预测建模效果(r=0.80)。一组编辑过白内障视频的学生评估者也对其进行了评分,他们的评分比众包人员更接近专家。
众包排名与专家评分相关,但并不等同;众包评分高估了技术能力,尤其是对新手外科医生。一种调整手术时间的众包评分的新方法为手术技能的预测模型提供了更准确的结果。在可以可靠地将众包用于评估手术熟练度之前,还需要进行更多的研究。