• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
Large Language Models' Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study.大语言模型关于何时进行肾活检的临床决策:比较研究
J Med Internet Res. 2025 Sep 18;27:e73603. doi: 10.2196/73603.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
4
Comparing Artificial Intelligence and Senior Residents in Oral Lesion Diagnosis: A Comparative Study.人工智能与住院医师在口腔病变诊断中的比较:一项对比研究。
Cureus. 2024 Jan 3;16(1):e51584. doi: 10.7759/cureus.51584. eCollection 2024 Jan.
5
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
6
Sexual Harassment and Prevention Training性骚扰与预防培训
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
9
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.大语言模型在放射实践认证考试中的性能比较分析
Radiol Technol. 2025 May-Jun;96(5):334-342.
10
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

引用本文的文献

1
From digital assistants to clinical partners: revolutionizing pediatric urology through large language model-powered decision support and patient education.从数字助手到临床伙伴:通过大语言模型驱动的决策支持和患者教育变革小儿泌尿外科
World J Urol. 2025 Oct 11;43(1):606. doi: 10.1007/s00345-025-05990-x.

本文引用的文献

1
Medical foundation large language models for comprehensive text analysis and beyond.用于综合文本分析及其他用途的医学基础大语言模型。
NPJ Digit Med. 2025 Mar 5;8(1):141. doi: 10.1038/s41746-025-01533-1.
2
An International Study of Variation in Attitudes to Kidney Biopsy Practice.一项关于肾脏活检操作态度差异的国际研究。
Clin J Am Soc Nephrol. 2025 Mar 1;20(3):377-386. doi: 10.2215/CJN.0000000607. Epub 2024 Dec 20.
3
Large Language Models and the Degradation of the Medical Record.大语言模型与病历质量的下降
N Engl J Med. 2024 Oct 31;391(17):1561-1564. doi: 10.1056/NEJMp2405999. Epub 2024 Oct 26.
4
Can AI have common sense? Finding out will be key to achieving machine intelligence.人工智能能具备常识吗?弄清楚这一点将是实现机器智能的关键。
Nature. 2024 Oct;634(8033):291-294. doi: 10.1038/d41586-024-03262-z.
5
A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.一项试点可行性研究,比较大语言模型从爱尔兰人群的重症监护病房患者文本记录中提取关键信息的能力。
Intensive Care Med Exp. 2024 Aug 16;12(1):71. doi: 10.1186/s40635-024-00656-1.
6
The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland.ChatGPT在医学领域的潜力:以波兰肾脏病专业考试为例的分析
Clin Kidney J. 2024 Jun 22;17(8):sfae193. doi: 10.1093/ckj/sfae193. eCollection 2024 Aug.
7
Evaluation and mitigation of the limitations of large language models in clinical decision-making.评估和缓解大型语言模型在临床决策中的局限性。
Nat Med. 2024 Sep;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1. Epub 2024 Jul 4.
8
Medical Artificial Intelligence and Human Values.医学人工智能与人类价值观
N Engl J Med. 2024 May 30;390(20):1895-1904. doi: 10.1056/NEJMra2214183.
9
AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4.AI 对战 MD:评估 ChatGPT-4 在外科决策中的准确性。
Surgery. 2024 Aug;176(2):241-245. doi: 10.1016/j.surg.2024.04.003. Epub 2024 May 19.
10
Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection.使用大型语言模型支持内容分析:以 ChatGPT 检测不良事件为例。
J Med Internet Res. 2024 May 2;26:e52499. doi: 10.2196/52499.

大语言模型关于何时进行肾活检的临床决策:比较研究

Large Language Models' Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study.

作者信息

Toal Michael, Hill Christopher, Quinn Michael, O'Neill Ciaran, Maxwell Alexander P

机构信息

Centre for Public Health, Royal Victoria Hospital, Queen's University Belfast, Grosvenor Road, Belfast, BT12 6BA, United Kingdom, 44 28 9097 6350.

Regional Centre for Nephrology and Transplantation, Belfast City Hospital, Belfast, United Kingdom.

出版信息

J Med Internet Res. 2025 Sep 18;27:e73603. doi: 10.2196/73603.

DOI:10.2196/73603
PMID:40966592
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12445783/
Abstract

BACKGROUND

Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research.

OBJECTIVE

This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed.

METHODS

This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks.

RESULTS

The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34.

CONCLUSIONS

The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.

摘要

背景

人工智能(AI)和大语言模型(LLMs)日益复杂,并正被整合到许多学科中。大语言模型增强临床决策的潜力是一个不断发展的研究领域。

目的

本研究使用一份关于何时应进行肾活检的问卷,比较了1000多名肾脏专科医生(肾病学家)的回答与常用大语言模型的输出结果。

方法

该研究小组为肾病学家完成了一份大型在线问卷,以确定何时应进行肾活检。该问卷是与患者共同设计的,经过多次迭代完善,并在本地进行了试点,然后在国际上传播。这是该领域最大的国际研究,显示了人类临床医生在活检倾向方面存在差异,这与性别和年龄等人为因素以及国家、工作资历和技术熟练程度等系统因素有关。在同一次会议中,以相同顺序向人类医生和大语言模型提出相同的问题。对八个常用的大语言模型进行了询问:ChatGPT-3.5、米斯特拉尔Hugging Face、Perplexity、微软Copilot、Llama 2、GPT-4、MedLM和Claude 3。临床医生(人类模式)对每个问题给出的最常见回答被用作比较的基线。关于活检适应症和禁忌症的问卷回答产生了一个反映活检倾向的分数(0-44),其中较高的分数被用作潜在相关风险耐受性增加的替代指标。

结果

大语言模型复制人类专家共识的能力差异很大,一些模型以与人类相似的方式展示了对风险的平衡处理方法,而其他模型报告的输出结果在风险耐受性的两端。在与人类模式的一致性方面,ChatGPT-3.5和GPT-4(OpenAI)的一致性水平最高,在11个问题中有6个与人类模式一致。人类模式产生的总活检倾向分数为44分中的23分。两个OpenAI模型产生的倾向分数相似,在22到24分之间。然而,Llama 2和微软Copilot的分数也在这个范围内,但在11个问题中只有2个与人类共识的回答一致性较差。本研究中最规避风险的模型是MedLM,倾向分数为11分,最不规避风险的模型是Claude 3,分数为34分。

结论

在本研究中,大语言模型的输出结果显示出一定的复制人类临床决策的能力;然而,不同大语言模型之间的表现差异很大。人类回答较为一致的问题产生的大语言模型输出结果一致性较高,而人类共识较低的问题显示出较差的输出一致性。这可能会限制大语言模型在实际临床实践中的实际应用。