Suppr超能文献

大语言模型关于何时进行肾活检的临床决策:比较研究

Large Language Models' Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study.

作者信息

Toal Michael, Hill Christopher, Quinn Michael, O'Neill Ciaran, Maxwell Alexander P

机构信息

Centre for Public Health, Royal Victoria Hospital, Queen's University Belfast, Grosvenor Road, Belfast, BT12 6BA, United Kingdom, 44 28 9097 6350.

Regional Centre for Nephrology and Transplantation, Belfast City Hospital, Belfast, United Kingdom.

出版信息

J Med Internet Res. 2025 Sep 18;27:e73603. doi: 10.2196/73603.

Abstract

BACKGROUND

Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research.

OBJECTIVE

This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed.

METHODS

This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks.

RESULTS

The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34.

CONCLUSIONS

The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.

摘要

背景

人工智能(AI)和大语言模型(LLMs)日益复杂,并正被整合到许多学科中。大语言模型增强临床决策的潜力是一个不断发展的研究领域。

目的

本研究使用一份关于何时应进行肾活检的问卷,比较了1000多名肾脏专科医生(肾病学家)的回答与常用大语言模型的输出结果。

方法

该研究小组为肾病学家完成了一份大型在线问卷,以确定何时应进行肾活检。该问卷是与患者共同设计的,经过多次迭代完善,并在本地进行了试点,然后在国际上传播。这是该领域最大的国际研究,显示了人类临床医生在活检倾向方面存在差异,这与性别和年龄等人为因素以及国家、工作资历和技术熟练程度等系统因素有关。在同一次会议中,以相同顺序向人类医生和大语言模型提出相同的问题。对八个常用的大语言模型进行了询问:ChatGPT-3.5、米斯特拉尔Hugging Face、Perplexity、微软Copilot、Llama 2、GPT-4、MedLM和Claude 3。临床医生(人类模式)对每个问题给出的最常见回答被用作比较的基线。关于活检适应症和禁忌症的问卷回答产生了一个反映活检倾向的分数(0-44),其中较高的分数被用作潜在相关风险耐受性增加的替代指标。

结果

大语言模型复制人类专家共识的能力差异很大,一些模型以与人类相似的方式展示了对风险的平衡处理方法,而其他模型报告的输出结果在风险耐受性的两端。在与人类模式的一致性方面,ChatGPT-3.5和GPT-4(OpenAI)的一致性水平最高,在11个问题中有6个与人类模式一致。人类模式产生的总活检倾向分数为44分中的23分。两个OpenAI模型产生的倾向分数相似,在22到24分之间。然而,Llama 2和微软Copilot的分数也在这个范围内,但在11个问题中只有2个与人类共识的回答一致性较差。本研究中最规避风险的模型是MedLM,倾向分数为11分,最不规避风险的模型是Claude 3,分数为34分。

结论

在本研究中,大语言模型的输出结果显示出一定的复制人类临床决策的能力;然而,不同大语言模型之间的表现差异很大。人类回答较为一致的问题产生的大语言模型输出结果一致性较高,而人类共识较低的问题显示出较差的输出一致性。这可能会限制大语言模型在实际临床实践中的实际应用。

相似文献

本文引用的文献

2
An International Study of Variation in Attitudes to Kidney Biopsy Practice.一项关于肾脏活检操作态度差异的国际研究。
Clin J Am Soc Nephrol. 2025 Mar 1;20(3):377-386. doi: 10.2215/CJN.0000000607. Epub 2024 Dec 20.
3
Large Language Models and the Degradation of the Medical Record.大语言模型与病历质量的下降
N Engl J Med. 2024 Oct 31;391(17):1561-1564. doi: 10.1056/NEJMp2405999. Epub 2024 Oct 26.
8
Medical Artificial Intelligence and Human Values.医学人工智能与人类价值观
N Engl J Med. 2024 May 30;390(20):1895-1904. doi: 10.1056/NEJMra2214183.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验