Malpani Anand, Vedula S Swaroop, Chen Chi Chiung Grace, Hager Gregory D
Johns Hopkins University, 3400 N Charles St, Hackerman Hall Room 200, Baltimore, MD, USA,
Int J Comput Assist Radiol Surg. 2015 Sep;10(9):1435-47. doi: 10.1007/s11548-015-1238-6. Epub 2015 Jun 30.
Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores.
Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers.
We observed moderate inter-rater reliability within the crowd (Fleiss' kappa, κ = 0.41) and experts (κ = 0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores (ρ ≥ 0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other (ρ ≥ 0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin.
Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.
目前可用的手术技能评估方法要么是主观的,要么仅对整个任务提供整体评估。这种整体评估无法告知受训者在任务的哪个环节需要表现得更好。在本研究中,我们调查了一个为任务中的各个部分生成客观技能评估的框架的可靠性和有效性,并将我们框架中的评估结果与来自未经手术训练的个体和专家外科医生的众包部分评分以及手动分配的整体评分进行了比较。
我们的框架包括:(1)一个二元分类器,经过训练以生成对任务部分对的偏好(即,给定一对部分,指明哪一个执行得更好);(2)根据偏好计算部分级百分位数分数;(3)使用部分级分数预测任务级分数。我们进行了一项众包用户研究,以从未经手术训练的个体群体和一组专家那里获得缝合和打结任务中各部分的手动偏好。我们分析了从群体和专家那里获得的偏好的评分者间可靠性,并研究了使用我们的框架获得的任务级分数的有效性。此外,我们比较了群体和专家偏好分类器的准确性,以及从分类器获得的部分级和任务级分数。
我们观察到群体内部(Fleiss'卡方系数,κ = 0.41)和专家内部(κ = 0.55)的评分者间可靠性为中等。对于群体和专家而言,与评分者间一致性相比,使用所有任务部分训练的自动分类器的准确性高于平均水平[群体分类器85%(标准误2%),专家分类器89%(标准误3%)]。我们预测任务的整体全局评分(GRS)的均方根误差低于真实GRS的一个标准差。我们观察到使用群体和专家偏好分类器获得的部分级分数之间具有高度相关性(ρ≥0.86)。使用群体和专家偏好分类器获得的任务级分数彼此之间也高度相关(ρ≥0.84),并且在两点的误差范围内在统计学上等效(分数范围为6至30)。然而,我们的分析并未在群体和专家分类器的准确性在10%误差范围内的等效性上显示出统计学显著性。
我们使用众包成对比较实现的框架能够对任务中的各个部分以及整个任务进行有效的客观手术技能评估。众包能够高效地对任务中的部分技能进行可靠的成对比较。我们的框架可部署在手术培训项目中,用于对技术技能进行客观、自动和标准化的评估。