Suppr超能文献

在临床不确定性条件下,大型语言模型GPT-4与内分泌学家关于降糖药物初始选择的反应比较。

Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty.

作者信息

Flory James H, Ancker Jessica S, Kim Scott Y H, Kuperman Gilad, Petrov Aleksandr, Vickers Andrew

机构信息

Memorial Sloan Kettering Cancer Center, New York, NY.

Vanderbilt University Medical Center, Nashville, TN.

出版信息

Diabetes Care. 2025 Feb 1;48(2):185-192. doi: 10.2337/dc24-1067.

Abstract

OBJECTIVE

To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.

RESEARCH DESIGN AND METHODS

This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.

RESULTS

With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.

CONCLUSIONS

In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.

摘要

目的

探讨在最佳答案存在不确定性时,商用大语言模型(LLM)GPT-4在解答医学问题时与内分泌科医生相比表现如何。

研究设计与方法

本研究使用聚焦于糖尿病的假设临床病例,比较了GPT-4的回答与31位内分泌科医生的回答,具体考察二甲双胍与替代治疗的处方情况。主要结局是二甲双胍与其他治疗方法之间的选择。

结果

在简单提示下,GPT-4在12%(95%置信区间7.9 - 17%)的回答中选择了二甲双胍,而内分泌科医生的这一比例为31%(95%置信区间23 - 39%)。在修改提示以鼓励使用二甲双胍后,GPT-4选择二甲双胍的比例增至25%(95%置信区间22 - 28%)。GPT-4很少为肾功能受损或有胃肠道不适病史的患者选择二甲双胍(回答的2.9%,95%置信区间1.4 - 5.5%)。相比之下,内分泌科医生即使在有胃肠道不适病史的患者中也经常开具二甲双胍(回答的21%,95%置信区间12 - 36%)。除了在肾功能处于中等水平时,GPT-4的回答在重复运行时显示出较低的变异性。

结论

在没有单一正确答案的临床场景中,GPT-4的回答是合理的,但在临床上重要的方面与内分泌科医生的回答不同。需要进行价值判断来确定何时应通过调整模型来解决这些差异。我们建议在LLM输出不仅符合临床指南,还符合患者和临床医生的偏好,或者其在临床结局方面优于标准治疗之前,不要依赖其输出结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验