Gazis Athanasios, Schizas Dimitrios, Kykalos Stylianos, Karaiskos Pantelis, Loukas Constantinos
Laboratory of Medical Physics, Medical School, National and Kapodistrian University of Athens, Mikras Asias 75, 11527, Athens, Attiki, Greece.
1st Department of Surgery, Laikon General Hospital, National and Kapodistrian University of Athens, Agiou Thoma 17, 11527, Athens, Attiki, Greece.
Int J Comput Assist Radiol Surg. 2025 Sep 18. doi: 10.1007/s11548-025-03518-7.
While significant progress has been made in skill assessment for minimally invasive procedures, objective evaluation methods for open surgery remain limited. This paper presents a deep learning framework for assessing technical surgical skills using egocentric video data from open surgery training.
Our dataset includes 201 videos and corresponding hand kinematics data from three fundamental training task-knot tying (KT), continuous suturing (CS), and interrupted suturing (IS)-performed by 20 participants. Each video was annotated by two experts using a modified OSATS scale (KT: five criteria, total score range: 5-25; CS/IS: seven criteria, total score range: 7-35). We evaluate three temporal architectures (LSTM, TCN, and Transformer), each using ResNet50 as the backbone for spatial feature extraction, and assess them under various training strategies: single-task learning, feature concatenation, pretraining, and multi-task learning with integrated kinematic data. Performance metrics included mean absolute error (MAE) and Spearman correlation coefficient ( ), both with respect to total score prediction.
The Transformer-based models consistently outperformed LSTM and TCN across all tasks. The multi-task Transformer incorporating prediction of task completion time ( ) achieved the lowest MAE (KT: 1.92, CS: 2.81, and IS: 2.89) and = 0.84- 0.90. It also demonstrated promising capabilities for early skill assessment by predicting the total score from partial observations-particularly for simpler tasks. Additionally, we show that models trained on consensus expert ratings outperform those trained on individual annotations, highlighting the value of multi-rater ground truth.
This research provides a foundation for objective, automated assessment of open surgical skills, with potential to improve the efficiency and standardization of surgical training.
虽然在微创手术技能评估方面已取得显著进展,但开放手术的客观评估方法仍然有限。本文提出了一种深度学习框架,用于使用开放手术训练中的第一人称视角视频数据评估手术技术技能。
我们的数据集包括来自20名参与者进行的三项基本训练任务——打结(KT)、连续缝合(CS)和间断缝合(IS)的201个视频及相应的手部运动学数据。每个视频由两名专家使用改良的OSATS量表进行标注(KT:五个标准,总分范围:5 - 25;CS/IS:七个标准,总分范围:7 - 35)。我们评估了三种时间架构(长短期记忆网络、门控循环单元和Transformer),每种架构都使用ResNet50作为空间特征提取的主干,并在各种训练策略下对其进行评估:单任务学习、特征拼接、预训练以及结合运动学数据的多任务学习。性能指标包括平均绝对误差(MAE)和斯皮尔曼相关系数( ),均针对总分预测。
基于Transformer的模型在所有任务中始终优于长短期记忆网络和门控循环单元。结合任务完成时间预测( )的多任务Transformer实现了最低的MAE(KT:1.92,CS:2.81,IS:2.89),且 = 0.84 - 0.90。它还通过从部分观察中预测总分展示了早期技能评估的良好能力——特别是对于较简单的任务。此外,我们表明在专家共识评分上训练的模型优于在个体标注上训练的模型,突出了多评分者真实数据的价值。
本研究为开放手术技能的客观、自动化评估提供了基础,有望提高手术训练的效率和标准化。