Ramamurthi Adhitya, Neupane Bhabishya, Deshpande Priya, Hanson Ryan, Vegesna Srujan, Cray Deborah, Crotty Bradley H, Somai Melek, Brown Kellie R, Pawar Sachin S, Taylor Bradley, Kothari Anai N
Selig Hub for Surgical Data Science, Medical College of Wisconsin, Milwaukee.
Department of Surgery, Medical College of Wisconsin, Milwaukee.
JAMA Surg. 2025 Jul 9. doi: 10.1001/jamasurg.2025.2154.
Accurate prediction of surgical case duration is critical for operating room (OR) management, as inefficient scheduling can lead to reduced patient and surgeon satisfaction while incurring considerable financial costs.
To evaluate the feasibility and accuracy of large language models (LLMs) in predicting surgical case length using unstructured clinical data compared to existing estimation methods.
DESIGN, SETTING, AND PARTICIPANTS: This was a retrospective study analyzing elective surgical cases performed between January 2017 and December 2023 at a single academic medical center and affiliated community hospital ORs. Analysis included 125 493 eligible surgical cases, with 1950 used for LLM fine-tuning and 2500 for evaluation. An additional 500 cases from a community site were used for external validation. Cases were randomly sampled using strata to ensure representation across surgical specialties.
Eleven LLMs, including base models (GPT-4, GPT-3.5, Mistral, Llama-3, Phi-3) and 2 fine-tuned variants (GPT-4 fine-tuned, GPT-3.5 fine-tuned), were used to predict surgical case length based on clinical notes.
The primary outcome was average error between predicted and actual surgical case length (wheels-in to wheels-out time). The secondary outcome was prediction accuracy, defined as predicted length within 20% of actual duration.
Fine-tuned GPT-4 achieved the best performance with a mean absolute error (MAE) of 47.64 minutes (95% CI, 45.71-49.56) and R2 of 0.61, matching the performance of current OR scheduling (MAE, 49.34 minutes; 95% CI, 47.60-51.09; R2, 0.63; P = .10). Both GPT-4 fine-tuned and GPT-3.5 fine-tuned significantly outperformed current scheduling methods in accuracy (46.12% and 46.08% vs 40.92%, respectively; P < .001). GPT-4 fine-tuned outperformed all other models during external validation with similar performance metrics (MAE, 48.66 minutes; 95% CI, 45.31-52.00; accuracy, 46.0%). Base models demonstrated variable performance, with GPT-4 showing the highest performance among non-fine-tuned models (MAE, 59.20 minutes; 95% CI, 56.88 - 61.52).
The findings in this study suggest that fine-tuned LLMs can predict surgical case length with accuracy comparable to or exceeding current institutional scheduling methods. This indicates potential for LLMs to enhance operating room efficiency through improved case length prediction using existing clinical documentation.
准确预测手术时长对于手术室管理至关重要,因为安排不当会降低患者和外科医生的满意度,同时产生可观的财务成本。
与现有估计方法相比,评估大语言模型(LLMs)使用非结构化临床数据预测手术时长的可行性和准确性。
设计、设置和参与者:这是一项回顾性研究,分析了2017年1月至2023年12月在一家学术医疗中心及其附属社区医院手术室进行的择期手术病例。分析包括125493例符合条件的手术病例,其中1950例用于LLM微调,2500例用于评估。另外从一个社区站点选取500例病例用于外部验证。病例使用分层随机抽样,以确保涵盖各个外科专业。
使用11个LLMs,包括基础模型(GPT-4、GPT-3.5、Mistral、Llama-3、Phi-3)和2个微调变体(GPT-4微调、GPT-3.5微调),根据临床记录预测手术时长。
主要结局是预测手术时长与实际手术时长(从进手术室到出手术室的时间)之间的平均误差。次要结局是预测准确性,定义为预测时长在实际时长的20%以内。
微调后的GPT-4表现最佳,平均绝对误差(MAE)为47.64分钟(95%CI,45.71 - 49.56),R2为0.61,与当前手术室排班的表现相当(MAE,49.34分钟;95%CI,47.60 - 51.09;R2,0.63;P = 0.10)。GPT-4微调和GPT-3.5微调在准确性方面均显著优于当前排班方法(分别为46.12%和46.08%对40.92%;P < 0.001)。在外部验证中,微调后的GPT-4在所有其他模型中表现最佳,性能指标相似(MAE,48.66分钟;95%CI,45.31 - 52.00;准确性,46.0%)。基础模型表现各异,GPT-4在未微调模型中表现最佳(MAE,59.20分钟;95%CI,56.88 - 61.52)。
本研究结果表明,微调后的LLMs能够以与当前机构排班方法相当或更高的准确性预测手术时长。这表明LLMs有潜力通过利用现有临床文档改进手术时长预测来提高手术室效率。