Haider Syed Ali, Ho Olivia A, Borna Sahar, Gomez-Cabello Cesar A, Pressman Sophia M, Cole Dave, Sehgal Ajai, Leibovich Bradley C, Forte Antonio Jorge
Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA.
Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA.
Bioengineering (Basel). 2025 Jan 15;12(1):72. doi: 10.3390/bioengineering12010072.
Accurate identification of surgical instruments is crucial for efficient workflows and patient safety within the operating room, particularly in preventing complications such as retained surgical instruments. Artificial Intelligence (AI) models have shown the potential to automate this process. This study evaluates the accuracy of publicly available Large Language Models (LLMs)-ChatGPT-4, ChatGPT-4o, and Gemini-and a specialized commercial mobile application, Surgical-Instrument Directory (SID 2.0), in identifying surgical instruments from images. The study utilized a dataset of 92 high-resolution images of 25 surgical instruments (retractors, forceps, scissors, and trocars) photographed from multiple angles. Model performance was evaluated using accuracy, weighted precision, recall, and F1 score. ChatGPT-4o exhibited the highest accuracy (89.1%) in categorizing instruments (e.g., scissors, forceps). SID 2.0 (77.2%) and ChatGPT-4 (76.1%) achieved comparable accuracy, while Gemini (44.6%) demonstrated lower accuracy in this task. For precise subtype identification of instrument names (like "Mayo scissors" or "Kelly forceps"), all models had low accuracy, with SID 2.0 having an accuracy of 39.1%, followed by ChatGPT-4o (33.69%). Subgroup analysis revealed ChatGPT-4 and 4o recognized trocars in all instances. Similarly, Gemini identified surgical scissors in all instances. In conclusion, publicly available LLMs can reliably identify surgical instruments at the category level, with ChatGPT-4o demonstrating an overall edge. However, precise subtype identification remains a challenge for all models. These findings highlight the potential of AI-driven solutions to enhance surgical-instrument management and underscore the need for further refinements to improve accuracy and support patient safety.
准确识别手术器械对于手术室高效的工作流程和患者安全至关重要,特别是在预防诸如手术器械遗留等并发症方面。人工智能(AI)模型已显示出使这一过程自动化的潜力。本研究评估了公开可用的大语言模型(LLMs)——ChatGPT-4、ChatGPT-4o和Gemini——以及一款专门的商业移动应用程序“手术器械目录(SID 2.0)”从图像中识别手术器械的准确性。该研究使用了一个数据集,其中包含25种手术器械(牵开器、镊子、剪刀和套管针)从多个角度拍摄的92张高分辨率图像。使用准确率、加权精确率、召回率和F1分数来评估模型性能。ChatGPT-4o在对器械(如剪刀、镊子)进行分类时表现出最高的准确率(89.1%)。SID 2.0(77.2%)和ChatGPT-4(76.1%)达到了可比的准确率,而Gemini(44.6%)在这项任务中表现出较低的准确率。对于器械名称(如“梅奥剪刀”或“凯利镊子”)的精确亚型识别,所有模型的准确率都很低,其中SID 2.0的准确率为39.1%,其次是ChatGPT-4o(33.69%)。亚组分析显示,ChatGPT-4和4o在所有情况下都能识别套管针。同样,Gemini在所有情况下都能识别手术剪刀。总之,公开可用的大语言模型能够可靠地在类别层面识别手术器械,ChatGPT-4o总体上具有优势。然而,精确的亚型识别对所有模型来说仍然是一个挑战。这些发现凸显了人工智能驱动的解决方案在加强手术器械管理方面的潜力,并强调了进一步改进以提高准确性和保障患者安全的必要性。