Yao Mengrui, Zhang Wenjie, Wang Lin, Zhao Zhongwei, Jia Xiao
School of Control Science and Engineering, Shandong University, Jinan 250100, China.
Department of Urology, Qilu Hospital of Shandong University, Shandong University, Jinan 250100, China.
Sensors (Basel). 2025 Aug 26;25(17):5306. doi: 10.3390/s25175306.
Artificial intelligence has shown great promise in advancing intelligent surgical systems. Among its applications, surgical video action recognition plays a critical role in enabling accurate intraoperative understanding and decision support. However, the task remains challenging due to the temporal continuity of surgical scenes and the long-tailed, semantically entangled distribution of action triplets composed of instruments, verbs, and targets. To address these issues, we propose TriQuery, a query-based model for surgical triplet recognition and classification. Built on a multi-task Transformer framework, TriQuery decomposes the complex triplet task into three semantically aligned subtasks using task-specific query tokens, which are processed through specialized attention mechanisms. We introduce a Multi-Query Decoding Head (MQ-DH) to jointly model structured subtasks and a Top-K Guided Query Update (TKQ) module to incorporate inter-frame temporal cues. Experiments on the CholecT45 dataset demonstrate that TriQuery achieves improved overall performance over existing baselines across multiple classification tasks. Attention visualizations further show that task queries consistently attend to semantically relevant spatial regions, enhancing model interpretability. These results highlight the effectiveness of TriQuery for advancing surgical video understanding in clinical environments.
人工智能在推进智能手术系统方面展现出了巨大的潜力。在其应用中,手术视频动作识别对于实现准确的术中理解和决策支持起着关键作用。然而,由于手术场景的时间连续性以及由器械、动词和目标组成的动作三元组的长尾、语义纠缠分布,该任务仍然具有挑战性。为了解决这些问题,我们提出了TriQuery,一种用于手术三元组识别和分类的基于查询的模型。TriQuery基于多任务Transformer框架构建,使用特定任务的查询令牌将复杂的三元组任务分解为三个语义对齐的子任务,并通过专门的注意力机制进行处理。我们引入了多查询解码头(MQ-DH)来联合建模结构化子任务,并引入了Top-K引导查询更新(TKQ)模块来纳入帧间时间线索。在CholecT45数据集上的实验表明,TriQuery在多个分类任务中比现有基线取得了更好的整体性能。注意力可视化进一步表明,任务查询始终关注语义相关的空间区域,增强了模型的可解释性。这些结果突出了TriQuery在临床环境中推进手术视频理解的有效性。