Stueker Esther Helene, Kolbinger Fiona R, Saldanha Oliver Lester, Digomann David, Pistorius Steffen, Oehme Florian, Van Treeck Marko, Ferber Dyke, Löffler Chiara Maria Lavinia, Weitz Jürgen, Distler Marius, Kather Jakob Nikolas, Muti Hannah Sophie
Else Kröner Fresenius Center for Digital Health, Dresden University of Technology, Dresden, Germany.
Department for Visceral, Thoracic and Vascular Surgery, University Hospital and Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany.
Int J Surg. 2025 Jul 17. doi: 10.1097/JS9.0000000000003069.
The ongoing shortage of medical personnel highlights the urgent need to automate clinical documentation and reduce administrative burden. Large Vision-Language Models (VLMs) offer promising potential for supporting surgical documentation and intraoperative analysis.
We conducted an observational, comparative performance study of two general-purpose VLMs-GPT-4o (OpenAI) and Gemini-1.5-pro (Google)-from June to September 2024, using 15 cholecystectomy and 15 appendectomy videos (1 fps) from the CholecT45 and LapApp datasets. Tasks included object detection (vessel clips, gauze, retrieval bags, bleeding), surgery type classification, appendicitis grading, and surgical report generation. In-context learning (ICL) was evaluated as an enhancement method. Performance was assessed using descriptive accuracy metrics.
Both models identified vessel clips with 100% accuracy. GPT-4o outperformed Gemini-1.5-pro in retrieval bag (100% vs. 93.3%) and gauze detection (93.3% vs. 60%), while Gemini-1.5-pro showed better results in bleeding detection (93.3% vs. 86.7%). In surgery classification, Gemini-1.5-pro was more accurate for cholecystectomies (93% vs. 80%), with both models achieving 60% accuracy for appendectomies. Appendicitis grading showed limited performance (GPT-4o: 40%, Gemini-1.5-pro: 26.7%). For surgical reports, GPT-4o produced for CCE more complete outputs (CCE: 90.4%, APE: 80.1%), while Gemini-1.5-pro achieved higher correctness overall (CCE: 71.1%, APE: 69.6%). ICL notably improved tool recognition (e.g., in APE step 4, GPT-4o improved from 69.2% to 80%), though its effect on organ removal step recognition was inconsistent.
GPT-4o and Gemini-1.5-pro performed reliably in object detection and procedure classification but showed limitations in grading pathology and accurately describing procedural steps, which could be enhanced through in-context learning. This shows that domain-agnostic VLMs can be applied to surgical video analysis. In the future, VLMs with domain knowledge can be envisioned to enhance the operating room in the form of companions.
持续的医务人员短缺凸显了自动化临床文档记录和减轻行政负担的迫切需求。大型视觉语言模型(VLM)为支持手术文档记录和术中分析提供了有前景的潜力。
我们在2024年6月至9月对两个通用VLM——GPT-4o(OpenAI)和Gemini-1.5-pro(谷歌)进行了一项观察性比较性能研究,使用了来自CholecT45和LapApp数据集的15个胆囊切除术视频和15个阑尾切除术视频(1帧/秒)。任务包括目标检测(血管夹、纱布、回收袋、出血)、手术类型分类、阑尾炎分级和手术报告生成。上下文学习(ICL)作为一种增强方法进行评估。使用描述性准确性指标评估性能。
两个模型识别血管夹的准确率均为100%。GPT-4o在回收袋检测(100%对93.3%)和纱布检测(93.3%对60%)方面优于Gemini-1.5-pro,而Gemini-1.5-pro在出血检测方面表现更好(93.3%对86.7%)。在手术分类中,Gemini-1.5-pro对胆囊切除术的准确率更高(93%对80%),两个模型对阑尾切除术的准确率均为60%。阑尾炎分级表现有限(GPT-4o:40%,Gemini-1.5-pro:26.7%)。对于手术报告,GPT-4o生成的胆囊切除术报告输出更完整(胆囊切除术:90.4%,阑尾切除术:80.1%),而Gemini-1.5-pro总体正确性更高(胆囊切除术:71.1%,阑尾切除术:69.6%)。ICL显著提高了工具识别能力(例如,在阑尾切除术步骤4中,GPT-4o从69.2%提高到80%),尽管其对器官切除步骤识别的影响并不一致。
GPT-4o和Gemini-1.5-pro在目标检测和手术过程分类方面表现可靠,但在病理分级和准确描述手术步骤方面存在局限性,可通过上下文学习加以改进。这表明通用VLM可应用于手术视频分析。未来,可以设想具有领域知识功能的VLM以助手的形式增强手术室的功能。