• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GPT-3.5 和 GPT-4.0 在外科中的诊断能力:比较分析。

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis.

机构信息

Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China.

Department of Respiratory and Critical Care Medicine, The First Medical Centre, Chinese PLA General Hospital, Beijing, China.

出版信息

J Med Internet Res. 2024 Sep 10;26:e54985. doi: 10.2196/54985.

DOI:10.2196/54985
PMID:39255016
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11422746/
Abstract

BACKGROUND

ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations.

OBJECTIVE

This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

METHODS

We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

RESULTS

Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t=-7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5.

CONCLUSIONS

This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities.

摘要

背景

ChatGPT(OpenAI)在临床诊断方面表现出巨大潜力,可能成为临床实践中的优秀辅助工具。本研究通过比较模型迭代中的 GPT-3.5 和 GPT-4.0 的性能,调查和评估 ChatGPT 在诊断能力方面的表现。

目的

本研究旨在评估 GPT-3.5 和 GPT-4.0 对结肠癌的精确诊断能力及其作为外科医生辅助诊断工具的潜力,并比较 GTP-3.5 和 GPT-4.0 的诊断准确率。我们准确评估了主要和次要诊断的准确性,并根据 7 个类别分析 GPT-3.5 和 GPT-4.0 误诊的原因:患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现。

方法

我们从中华医学会出版数据库中检索了 316 例肠肿瘤病例报告,经过数据清理后,有 286 例被认为是有效的。将病例从中文翻译成英文,然后使用简单、直接的提示将其输入 GPT-3.5 和 GPT-4.0 中,以获取主要和次要诊断。我们进行了一项比较研究,以评估 GPT-4.0 和 GPT-3.5 的诊断准确性。来自解放军总医院普外科(普通外科)的 3 名高级外科医生对诊断信息进行了评估。根据预先设定的标准对主要和次要诊断的准确性进行评分。此外,我们根据 7 个类别(患者病史、症状、体征、实验室检查、影像学检查、病理检查和术中发现)分析和比较了两个模型中的误诊原因。

结果

在 286 例病例中,GPT-4.0 和 GPT-3.5 对主要诊断均表现出较高的诊断准确性,但 GPT-4.0 的准确性明显高于 GPT-3.5(均值 0.972,标准差 0.137 与均值 0.855,标准差 0.335;t=5.753;P<.001)。对于次要诊断,GPT-4.0 的准确性也明显高于 GPT-3.5(均值 0.908,标准差 0.159 与均值 0.617,标准差 0.349;t=-7.727;P<.001)。GPT-3.5 在处理患者病史、症状表现、实验室检查和影像学数据方面存在局限性。虽然 GPT-4.0 改进了 GPT-3.5,但它在识别症状和实验室检查数据方面仍存在局限性。对于主要和次要诊断,GPT-4.0 和 GPT-3.5 的准确性与年龄、性别或系统组之间没有显著差异。

结论

本研究表明,ChatGPT,特别是 GPT-4.0,具有显著的诊断潜力,GPT-4.0 的准确性高于 GPT-3.5。然而,GPT-4.0 仍存在局限性,特别是在识别患者症状和实验室数据方面,这表明需要在真实临床环境中进行更多研究,以提高其诊断能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/302d/11422746/bd9c1ddc7768/jmir_v26i1e54985_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/302d/11422746/bd9c1ddc7768/jmir_v26i1e54985_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/302d/11422746/bd9c1ddc7768/jmir_v26i1e54985_fig1.jpg

相似文献

1
The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis.GPT-3.5 和 GPT-4.0 在外科中的诊断能力:比较分析。
J Med Internet Res. 2024 Sep 10;26:e54985. doi: 10.2196/54985.
2
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
3
ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis.ChatGPT 联合 GPT-4 在诊断准确率上优于急诊科医生:回顾性分析。
J Med Internet Res. 2024 Jul 8;26:e56110. doi: 10.2196/56110.
4
Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases.评估ChatGPT-4在鉴别诊断中识别最终诊断的准确性与医生的准确性比较:诊断病例的实验研究
JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.
5
Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.评估 GPT-4 提供医疗建议的表现:与人类专家的比较分析。
JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.
6
Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。
Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.
7
ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.与放射科医生在肌肉骨骼放射学中的诊断表现相比,基于文本与视觉信息的ChatGPT的诊断表现。
Eur Radiol. 2025 Jan;35(1):506-516. doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12.
8
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性:评估研究
JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.
9
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
10
Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.评估生成式预训练转换器(GPT)在临床决策中的应用:GPT-3.5 和 GPT-4 的对比分析。
J Med Internet Res. 2024 Jun 27;26:e54571. doi: 10.2196/54571.

引用本文的文献

1
The performance of ChatGPT on medical image-based assessments and implications for medical education.ChatGPT在基于医学图像的评估中的表现及其对医学教育的影响。
BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.
2
Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.口腔修复学和牙体修复学中聊天机器人对基于文本的多项选择题的回答评估
Dent J (Basel). 2025 Jun 21;13(7):279. doi: 10.3390/dj13070279.
3
Large language models in perioperative medicine-applications and future prospects: a narrative review.

本文引用的文献

1
ChatGPT: A Conceptual Review of Applications and Utility in the Field of Medicine.ChatGPT:医学领域中的应用及效用的概念性综述。
J Med Syst. 2024 Jun 5;48(1):59. doi: 10.1007/s10916-024-02075-x.
2
Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.比较基于 GPT-4 的 ChatGPT、基于 GPT-4V 的 ChatGPT 和放射科医生在神经放射学挑战性病例中的诊断性能。
Clin Neuroradiol. 2024 Dec;34(4):779-787. doi: 10.1007/s00062-024-01426-y. Epub 2024 May 28.
3
Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.
围手术期医学中的大语言模型——应用与未来前景:一篇叙述性综述
Can J Anaesth. 2025 Jun 9. doi: 10.1007/s12630-025-02980-w.
4
Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.医学诊断中的大语言模型:基于文献计量分析的综述
J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.
5
Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性:混合方法研究
J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.
6
The Vast Potential of ChatGPT in Pediatric Surgery.ChatGPT在小儿外科的巨大潜力。
J Med Internet Res. 2024 Nov 18;26:e66453. doi: 10.2196/66453.
评估大语言模型在乳腺癌临床场景中的应用:基于 ChatGPT-3.5、ChatGPT-4.0 和 Claude2 的比较分析
Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.
4
Comparative analysis of GPT-3.5 and GPT-4.0 in Taiwan's medical technologist certification: A study in artificial intelligence advancements.GPT-3.5 和 GPT-4.0 在台湾医疗技师认证中的比较分析:人工智能进展研究。
J Chin Med Assoc. 2024 May 1;87(5):525-530. doi: 10.1097/JCMA.0000000000001092. Epub 2024 Mar 29.
5
ChatGPT in medicine: prospects and challenges: a review article.ChatGPT 在医学中的应用:前景与挑战:一篇综述文章。
Int J Surg. 2024 Jun 1;110(6):3701-3706. doi: 10.1097/JS9.0000000000001312.
6
ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis.ChatGPT 与 NASS 退行性脊柱滑脱临床指南比较分析。
Eur Spine J. 2024 Nov;33(11):4182-4203. doi: 10.1007/s00586-024-08198-6. Epub 2024 Mar 15.
7
Enhanced Artificial Intelligence Strategies in Renal Oncology: Iterative Optimization and Comparative Analysis of GPT 3.5 Versus 4.0.增强型人工智能策略在肾肿瘤学中的应用:GPT 3.5 与 4.0 的迭代优化与比较分析。
Ann Surg Oncol. 2024 Jun;31(6):3887-3893. doi: 10.1245/s10434-024-15107-0. Epub 2024 Mar 12.
8
Performance of ChatGPT in Diagnosis of Corneal Eye Diseases.ChatGPT 在角膜眼病诊断中的表现。
Cornea. 2024 May 1;43(5):664-670. doi: 10.1097/ICO.0000000000003492. Epub 2024 Feb 23.
9
Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports.评估生成式预训练变换器4(GPT-4)在规范放射学报告方面的性能。
Eur Radiol. 2024 Jun;34(6):3566-3574. doi: 10.1007/s00330-023-10384-x. Epub 2023 Nov 8.
10
Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.GPT-4 在眼科领域的能力:对模型熵的分析及迈向人类水平医学问答的进展。
Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.