• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

论大语言模型在临床诊断中的局限性

On the limitations of large language models in clinical diagnosis.

作者信息

Reese Justin T, Danis Daniel, Caufield J Harry, Groza Tudor, Casiraghi Elena, Valentini Giorgio, Mungall Christopher J, Robinson Peter N

机构信息

Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.

The Jackson Laboratory for Genomic Medicine, Farmington CT, 06032, USA.

出版信息

medRxiv. 2024 Feb 26:2023.07.13.23292613. doi: 10.1101/2023.07.13.23292613.

DOI:10.1101/2023.07.13.23292613
PMID:37503093
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10370243/
Abstract

OBJECTIVE

Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.

MATERIALS AND METHODS

We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.

RESULTS

Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.

DISCUSSION

The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.

CONCLUSION

Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

摘要

目的

诸如GPT-4之类的大语言模型此前已被应用于基于已发表病例报告的鉴别诊断挑战。已发表的病例报告具有复杂的叙事风格,这在典型的电子健康记录(EHR)中并不常见。此外,即使EHR中有这样的叙事内容,隐私要求也会阻止将其发送到医院防火墙之外。因此,我们测试了一种解析临床文本以提取本体术语并以编程方式生成提示的方法,这些提示在设计上不包含受保护的健康信息。

材料与方法

我们研究了从75份最近发表的病例报告中准备提示的不同方法。我们通过提取代表表型异常、合并症、治疗方法和实验室检查的结构化术语并以编程方式创建提示来转换原始叙事内容。

结果

所有这些方法的表现都一般,正确诊断在仅5.3%-17.6%的病例中排名第一。从结构化数据创建的提示的表现明显比原始叙事文本差,即使在人工审核术语提取后添加了额外信息也是如此。此外,不同版本的GPT-4在这项任务上表现出显著不同的性能。

讨论

性能对提示形式的敏感性以及两个GPT-4版本结果的不稳定性是目前使用GPT-4在现实临床环境中支持诊断的重要限制。

结论

需要开展研究以确定从典型可用临床数据创建提示以支持鉴别诊断的最佳方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/b2add01e9122/nihpp-2023.07.13.23292613v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/53181baebb5b/nihpp-2023.07.13.23292613v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/75d63af02dec/nihpp-2023.07.13.23292613v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/a5081ab3be6c/nihpp-2023.07.13.23292613v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/b2add01e9122/nihpp-2023.07.13.23292613v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/53181baebb5b/nihpp-2023.07.13.23292613v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/75d63af02dec/nihpp-2023.07.13.23292613v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/a5081ab3be6c/nihpp-2023.07.13.23292613v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/18f1/10901414/b2add01e9122/nihpp-2023.07.13.23292613v2-f0004.jpg

相似文献

1
On the limitations of large language models in clinical diagnosis.论大语言模型在临床诊断中的局限性
medRxiv. 2024 Feb 26:2023.07.13.23292613. doi: 10.1101/2023.07.13.23292613.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation.从临床记录中提取物质使用信息:基于生成式预训练变换器的研究
JMIR Med Inform. 2024 Aug 19;12:e56243. doi: 10.2196/56243.
4
Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model.在美国医疗保健系统中实施和应用预训练的大型人工智能语言模型:生成式预训练变换器3(GPT-3)作为服务模型的前景
JMIR Med Inform. 2022 Feb 10;10(2):e32875. doi: 10.2196/32875.
5
An evaluation of GPT models for phenotype concept recognition.GPT 模型在表型概念识别中的评估。
BMC Med Inform Decis Mak. 2024 Jan 31;24(1):30. doi: 10.1186/s12911-024-02439-w.
6
A Study of Biomedical Relation Extraction Using GPT Models.一项使用GPT模型进行生物医学关系提取的研究。
AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:391-400. eCollection 2024.
7
Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.系统基准测试表明,大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。
medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.
8
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
9
Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer.利用大语言模型开发提示,以从乳腺癌的病理学和超声报告中提取临床信息。
Radiat Oncol J. 2023 Sep;41(3):209-216. doi: 10.3857/roj.2023.00633. Epub 2023 Sep 21.
10
Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan.生成式预训练变换器在日本国家医师资格考试中的表现。
PLOS Digit Health. 2024 Jan 23;3(1):e0000433. doi: 10.1371/journal.pdig.0000433. eCollection 2024 Jan.

本文引用的文献

1
Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning.结构化提示查询和语义递归提取(SPIRES):一种使用零样本学习填充知识库的方法。
Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae104.
2
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。
J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.
3
Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders.
评估大型语言模型的性能:ChatGPT 和 Google Bard 在神经退行性疾病临床病理会议中生成鉴别诊断的能力。
Brain Pathol. 2024 May;34(3):e13207. doi: 10.1111/bpa.13207. Epub 2023 Aug 8.
4
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
5
ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?ChatGPT、GPT-4 和其他大型语言模型:临床微生物学的下一次革命?
Clin Infect Dis. 2023 Nov 11;77(9):1322-1328. doi: 10.1093/cid/ciad407.
6
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.
7
ChatGPT: a pioneering approach to complex prenatal differential diagnosis.ChatGPT:一种用于复杂产前鉴别诊断的开创性方法。
Am J Obstet Gynecol MFM. 2023 Aug;5(8):101029. doi: 10.1016/j.ajogmf.2023.101029. Epub 2023 May 29.
8
Foundation models for generalist medical artificial intelligence.通用型医学人工智能的基础模型。
Nature. 2023 Apr;616(7956):259-265. doi: 10.1038/s41586-023-05881-4. Epub 2023 Apr 12.
9
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
10
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.基于生成式预训练 Transformer 3 聊天机器人为常见主诉临床病例生成鉴别诊断列表的诊断准确性:一项初步研究。
Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378.