• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估ChatGPT-4在鉴别诊断中识别最终诊断的准确性与医生的准确性比较:诊断病例的实验研究

Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases.

作者信息

Hirosawa Takanobu, Harada Yukinori, Mizuta Kazuya, Sakamoto Tetsu, Tokumasu Kazuki, Shimizu Taro

机构信息

Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.

Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan.

出版信息

JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.

DOI:10.2196/59267
PMID:38924784
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11237772/
Abstract

BACKGROUND

The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists.

OBJECTIVE

This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series.

METHODS

We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4's evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.

RESULTS

The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4's evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians' evaluations.

CONCLUSIONS

GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.

摘要

背景

人工智能(AI)聊天机器人,尤其是配备GPT-4的ChatGPT(OpenAI),在辅助医学诊断方面的潜力是一个新兴的研究领域。然而,目前尚不清楚AI聊天机器人在评估最终诊断是否包含在鉴别诊断列表中表现如何。

目的

本研究旨在评估GPT-4从鉴别诊断列表中识别最终诊断的能力,并将其与医师对病例报告系列的表现进行比较。

方法

我们使用了《美国病例报告杂志》病例报告中的鉴别诊断列表数据库,这些列表对应最终诊断。这些列表由3个人工智能系统生成:GPT-4、谷歌巴德(目前是谷歌Gemini)和Meta AI 2的大语言模型(LLaMA2)。主要结果集中在GPT-4的评估是否能在这些列表中识别出最终诊断。这些人工智能均未接受额外的医学培训或强化。为作比较,2名独立的医师也对这些列表进行了评估,任何不一致之处由另一名医师解决。

结果

这3个人工智能从392个病例描述中总共生成了1176个鉴别诊断列表。在1176个列表中,GPT-4的评估与医师的评估在966个列表中一致(82.1%)。Cohen κ系数为0.63(95%CI 0.56 - 0.69),表明GPT-4与医师的评估之间存在中等至良好的一致性。

结论

GPT-4在从鉴别诊断列表中识别最终诊断方面表现出中等至良好的一致性,与医师对病例报告系列的表现相当。它将鉴别诊断列表与最终诊断进行比较的能力表明其有潜力通过诊断反馈辅助临床决策支持。虽然GPT-4在评估方面表现出中等至良好的一致性,但其在现实场景中的应用以及在不同临床环境中的进一步验证对于充分了解其在诊断过程中的效用至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/56a1ab01953b/formative_v8i1e59267_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/d5271d21bb18/formative_v8i1e59267_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/6967eb833a48/formative_v8i1e59267_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/cf934477e3d6/formative_v8i1e59267_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/56a1ab01953b/formative_v8i1e59267_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/d5271d21bb18/formative_v8i1e59267_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/6967eb833a48/formative_v8i1e59267_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/cf934477e3d6/formative_v8i1e59267_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39c6/11237772/56a1ab01953b/formative_v8i1e59267_fig4.jpg

相似文献

1
Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases.评估ChatGPT-4在鉴别诊断中识别最终诊断的准确性与医生的准确性比较:诊断病例的实验研究
JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.
2
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
3
Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?ChatGPT-4能否像医生一样准确评估鉴别诊断列表是否包含正确的诊断?
Diagnosis (Berl). 2024 Mar 12;11(3):321-324. doi: 10.1515/dx-2024-0027. eCollection 2024 Aug 1.
4
Diagnostic performance of generative artificial intelligences for a series of complex case reports.生成式人工智能对一系列复杂病例报告的诊断性能
Digit Health. 2024 Jul 21;10:20552076241265215. doi: 10.1177/20552076241265215. eCollection 2024 Jan-Dec.
5
ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.基于复杂病例临床案例生成的ChatGPT鉴别诊断列表:诊断准确性评估。
JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.
6
Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study.评估Gemini Advanced、Gemini和Bard生成的鉴别诊断列表准确性的比较研究:用于病例报告系列分析的横断面研究。
JMIR Med Inform. 2024 Oct 2;12:e63010. doi: 10.2196/63010.
7
Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。
Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.
8
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.基于生成式预训练 Transformer 3 聊天机器人为常见主诉临床病例生成鉴别诊断列表的诊断准确性:一项初步研究。
Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378.
9
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。
J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.
10
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

引用本文的文献

1
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.快速对用于诊断合并症患者的大语言模型进行基准测试:利用“大语言模型即评判者”方法的比较研究
JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.
2
Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation.在处理常见急诊病例方面,人工智能能否与急诊医生相媲美?一项比较性能评估。
BMC Emerg Med. 2025 Jul 31;25(1):142. doi: 10.1186/s12873-025-01303-y.
3
Computerized diagnostic decision support systems-Isabel Pro versus ChatGPT-4 part II.

本文引用的文献

1
Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions.多模态大语言模型在临床病例问题上的性能比较分析
JAMA. 2024 Apr 16;331(15):1320-1321. doi: 10.1001/jama.2023.27861.
2
Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?ChatGPT-4能否像医生一样准确评估鉴别诊断列表是否包含正确的诊断?
Diagnosis (Berl). 2024 Mar 12;11(3):321-324. doi: 10.1515/dx-2024-0027. eCollection 2024 Aug 1.
3
ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.
计算机化诊断决策支持系统——伊莎贝尔专业版与ChatGPT-4 第二部分
JAMIA Open. 2025 Jun 16;8(3):ooaf048. doi: 10.1093/jamiaopen/ooaf048. eCollection 2025 Jun.
4
Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study.人工智能模型GPT-4和GPT-3.5在运动外科和物理治疗临床决策中的比较评估:一项横断面研究。
BMC Med Inform Decis Mak. 2025 Apr 14;25(1):163. doi: 10.1186/s12911-025-02996-8.
基于复杂病例临床案例生成的ChatGPT鉴别诊断列表:诊断准确性评估。
JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.
4
Revolutionizing healthcare: the role of artificial intelligence in clinical practice.人工智能在临床实践中的应用:医疗保健的革命。
BMC Med Educ. 2023 Sep 22;23(1):689. doi: 10.1186/s12909-023-04698-z.
5
Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians.谷歌巴德与医生之间诊断准确性的比较评估
Am J Med. 2023 Nov;136(11):1119-1123.e18. doi: 10.1016/j.amjmed.2023.08.003. Epub 2023 Aug 27.
6
Burden of serious harms from diagnostic error in the USA.美国诊断错误导致的严重危害负担。
BMJ Qual Saf. 2024 Jan 19;33(2):109-120. doi: 10.1136/bmjqs-2021-014130.
7
Utility of ChatGPT in Clinical Practice.ChatGPT 在临床实践中的应用。
J Med Internet Res. 2023 Jun 28;25:e48568. doi: 10.2196/48568.
8
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.
9
Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions.医学教育中的大语言模型:机遇、挑战与未来方向。
JMIR Med Educ. 2023 Jun 1;9:e48291. doi: 10.2196/48291.
10
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023.临床医学中的人工智能与机器学习,2023年。
N Engl J Med. 2023 Mar 30;388(13):1201-1208. doi: 10.1056/NEJMra2302038.