Bejan Cosmin A, Reed Amy M, Mikula Matthew, Zhang Siwei, Xu Yaomin, Fabbri Daniel, Embí Peter J, Hsi Ryan S
medRxiv. 2024 Aug 13:2024.08.12.24311870. doi: 10.1101/2024.08.12.24311870.
Recent advancements of large language models (LLMs) like Generative Pre-trained Transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. This study investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were caused by symptomatic kidney stones.
Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance the performance of GPT-4, GPT-3.5, and Llama-2 including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by these LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The evaluation includes a comparison between LLMs, traditional machine learning models (logistic regression, extreme gradient boosting, and light gradient boosting machine), and a baseline system utilizing International Classification of Diseases (ICD) codes for kidney stones.
The best results were achieved by GPT-4 (macro-F1=0.833, 95% confidence interval [CI]=0.826-0.841) and GPT-3.5 (macro-F1=0.796, 95% CI=0.796-0.796), both being statistically significantly better than the ICD-based baseline result (macro-F1=0.71). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning when using the same parameter configuration. Adding demographic information and prior disease history to the prompts allows LLMs to make more accurate decisions. The evaluation of bias found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity. The analysis of explanations provided by GPT-4 demonstrates advanced capabilities of this model in understanding clinical text and reasoning with medical knowledge.
像生成式预训练变换器4(GPT-4)这样的大语言模型(LLMs)的最新进展引起了科学界的极大兴趣。然而,这些模型在临床环境中应用的潜力在很大程度上仍未得到探索。本研究调查了多个大语言模型和传统机器学习模型分析急诊科(ED)报告并确定相应就诊是否由有症状肾结石引起的能力。
利用一个人工标注的ED报告数据集,我们开发了提高GPT-4、GPT-3.5和Llama-2性能的策略,包括提示优化、零样本和少样本提示、微调以及提示增强。此外,我们实施了公平性评估和偏差缓解方法,以研究这些大语言模型在种族和性别方面的潜在差异。一位临床专家手动评估GPT-4为其预测生成的解释,以确定它们是否合理、事实正确、与输入提示无关或可能有害。评估包括大语言模型、传统机器学习模型(逻辑回归、极端梯度提升和轻梯度提升机)以及利用国际疾病分类(ICD)代码诊断肾结石的基线系统之间的比较。
GPT-4(宏F1 = 0.833,95%置信区间[CI] = 0.826 - 0.841)和GPT-3.5(宏F1 = 0.796,95% CI = 0.796 - 0.796)取得了最佳结果,两者在统计学上均显著优于基于ICD的基线结果(宏F1 = 0.71)。消融研究表明,初始预训练的GPT-3.5模型在使用相同参数配置时受益于微调。在提示中添加人口统计学信息和既往病史可使大语言模型做出更准确的决策。偏差评估发现,与未能有效模拟种族多样性的GPT-3.5相比,GPT-4没有表现出种族或性别差异。对GPT-4提供的解释的分析表明,该模型在理解临床文本和运用医学知识进行推理方面具有先进能力。