Bejan Cosmin A, Reed Amy M, Mikula Matthew, Zhang Siwei, Xu Yaomin, Fabbri Daniel, Embí Peter J, Hsi Ryan S
Department of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical Center, 2525 West End Avenue, Suite 1500, Nashville, TN, 37232, USA.
Department of Urology, Vanderbilt University Medical Center, Nashville, USA.
Sci Rep. 2025 Jan 28;15(1):3503. doi: 10.1038/s41598-025-86632-5.
Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. In this study, we investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were due to symptomatic kidney stones. Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance LLMs including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The best results were achieved by GPT-4 (macro-F1 = 0.833, 95% confidence interval [CI] 0.826-0.841) and GPT-3.5 (macro-F1 = 0.796, 95% CI 0.796-0.796). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning. Adding demographic information and prior disease history to the prompts allows LLMs to make better decisions. Bias assessment found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity.
像生成式预训练变换器4(GPT-4)这样的大语言模型(LLMs)的最新进展引起了科学界的极大兴趣。然而,这些模型在临床环境中应用的潜力在很大程度上仍未得到探索。在本研究中,我们调查了多个大语言模型和传统机器学习模型分析急诊科(ED)报告并确定相应就诊是否由症状性肾结石引起的能力。利用一个人工标注的ED报告数据集,我们开发了增强大语言模型的策略,包括提示优化、零样本和少样本提示、微调以及提示增强。此外,我们实施了公平性评估和偏差缓解方法,以研究大语言模型在种族和性别方面的潜在差异。一位临床专家人工评估了GPT-4为其预测生成的解释,以确定它们是否合理、事实正确、与输入提示无关或可能有害。GPT-4(宏F1 = 0.833,95%置信区间[CI] 0.826 - 0.841)和GPT-3.5(宏F1 = 0.796,95% CI 0.796 - 0.796)取得了最佳结果。消融研究表明,初始预训练的GPT-3.5模型受益于微调。在提示中添加人口统计信息和既往病史可使大语言模型做出更好的决策。偏差评估发现,与未能有效模拟种族多样性的GPT-3.5不同,GPT-4没有表现出种族或性别差异。