University of Washington, School of Medicine, Seattle, Washington.
Department of Bioengineering, University of Washington, Seattle, Washington.
J Surg Res. 2014 Mar;187(1):65-71. doi: 10.1016/j.jss.2013.09.024. Epub 2013 Oct 10.
Validated methods of objective assessments of surgical skills are resource intensive. We sought to test a web-based grading tool using crowdsourcing called Crowd-Sourced Assessment of Technical Skill.
Institutional Review Board approval was granted to test the accuracy of Amazon.com's Mechanical Turk and Facebook crowdworkers compared with experienced surgical faculty grading a recorded dry-laboratory robotic surgical suturing performance using three performance domains from a validated assessment tool. Assessor free-text comments describing their rating rationale were used to explore a relationship between the language used by the crowd and grading accuracy.
Of a total possible global performance score of 3-15, 10 experienced surgeons graded the suturing video at a mean score of 12.11 (95% confidence interval [CI], 11.11-13.11). Mechanical Turk and Facebook graders rated the video at mean scores of 12.21 (95% CI, 11.98-12.43) and 12.06 (95% CI, 11.57-12.55), respectively. It took 24 h to obtain responses from 501 Mechanical Turk subjects, whereas it took 24 d for 10 faculty surgeons to complete the 3-min survey. Facebook subjects (110) responded within 25 d. Language analysis indicated that crowdworkers who used negation words (i.e., "but," "although," and so forth) scored the performance more equivalently to experienced surgeons than crowdworkers who did not (P < 0.00001).
For a robotic suturing performance, we have shown that surgery-naive crowdworkers can rapidly assess skill equivalent to experienced faculty surgeons using Crowd-Sourced Assessment of Technical Skill. It remains to be seen whether crowds can discriminate different levels of skill and can accurately assess human surgery performances.
验证外科技能的客观评估方法需要耗费大量资源。我们试图测试一种基于网络的众包评分工具,称为众包技术技能评估。
获得机构审查委员会的批准,使用亚马逊 Mechanical Turk 和 Facebook 众包人员来测试他们的准确性,比较有经验的外科教员对使用经过验证的评估工具的三个绩效领域对记录的机器人外科缝合操作的干实验室性能进行评分。评估员对其评分理由的自由文本评论用于探索众包使用的语言与评分准确性之间的关系。
在总共可能的 3-15 分的全球绩效评分中,10 名经验丰富的外科医生对缝合视频的平均评分为 12.11 分(95%置信区间[CI],11.11-13.11)。Mechanical Turk 和 Facebook 评分者对视频的平均评分为 12.21 分(95%CI,11.98-12.43)和 12.06 分(95%CI,11.57-12.55)。从 501 名 Mechanical Turk 受试者中获得回复需要 24 小时,而 10 名教员外科医生完成 3 分钟的调查则需要 24 天。Facebook 受访者(110 人)在 25 天内做出了回应。语言分析表明,与未使用否定词(例如,“but”、“although”等)的众包人员相比,使用否定词的众包人员对表现的评分更接近经验丰富的外科医生(P<0.00001)。
对于机器人缝合性能,我们已经表明,手术新手的众包人员可以使用众包技术技能评估快速评估与经验丰富的教员外科医生相当的技能。众包人员是否可以区分不同水平的技能以及是否可以准确评估人类手术表现还有待观察。