• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT4在住院神经内科的诊断准确性:一项回顾性队列研究。

ChatGPT4's diagnostic accuracy in inpatient neurology: A retrospective cohort study.

作者信息

Cano-Besquet Sebastian, Rice-Canetto Tyler, Abou-El-Hassan Hadi, Alarcon Simon, Zimmerman Jason, Issagholian Leo, Salomon Nasser, Rojas Ivan, Dhahbi Joseph, Neeki Michael M

机构信息

California University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USA.

Department of Emergency Medicine, Arrowhead Regional Medical Center, 400 N. Pepper Ave, Colton, CA, 92324, USA.

出版信息

Heliyon. 2024 Dec 9;10(24):e40964. doi: 10.1016/j.heliyon.2024.e40964. eCollection 2024 Dec 30.

DOI:10.1016/j.heliyon.2024.e40964
PMID:39759322
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11699242/
Abstract

BACKGROUND

Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions and standardized test data. Among those studies, CG4 consistently outperformed alternative LLMs, including ChatGPT-3.5 (now no longer publicly available for use) and Google Bard (now known as "Google Gemini"). The next logical step was to begin to explore CG4's accuracy within a specific clinical domain. Our study evaluated the diagnostic accuracy of CG4 within an inpatient neurology consultation service.

METHODS

We conducted a review of all patients listed on the daily neurology consultation roster at Arrowhead Regional Medical Center in Colton, CA, for all days surveyed until we reached a total of 51 patients, ensuring a complete and representative sample of the patient population. ChatGPT-4, using HIPAA-compliant methodology, received patient data from the Epic EHR as input and was asked to provide an initial differential diagnoses list, investigations and recommended actions, a final diagnosis, and a treatment plan for each patient. A comprehensiveness scale (an ordinal scale between 0 and 3) was then used to rate match accuracy between consultant and CG4 initial diagnoses and the consultants' final diagnoses. In this proof-of-concept study, we assumed that the neurology consultants' final diagnoses were accurate. We employed a non-parametric bootstrap resampling method to create 95 % confidence intervals around mean scores, a Fisher's Exact test, a Wilcoxon Rank Sum test, and ordinal logistic regression models to compare the performance between consultant and CG4 groups.

FINDINGS

Our study found that CG4 demonstrated diagnostic accuracy comparable to that of consultant neurologists. The most frequent comprehensiveness score achieved by both groups was "3," with consultant neurologists achieving this score 43 times and CG4 achieving it 31 times. The mean comprehensiveness scores were 2.75 (95 % CI: 2.49-2.90) for the consultant group and 2.57 (95 % CI: 2.31-2.67) for the CG4 group. The success rate for comprehensive diagnoses (a score of "2″ or "3″) was 94.1 % (95 % CI: 84.1%-98.0 %) for consultants and 96.1 % (95 % CI: 86.8%-98.9 %) for CG4, with no statistically significant difference in success rates (p = 1.00). The Wilcoxon Rank Sum Test indicated that the consultant group had a higher likelihood of providing more comprehensive diagnoses (W = 1583, p = 0.02). Ordinal logistic regression models identified significant predictors of diagnostic accuracy, with the consultant diagnosis group showing an odds ratio of 3.68 (CI 95 %: 1.28-10.55) for higher value outcomes. Notably, integrating CG4's initial diagnoses with those from consultants could achieve comprehensive diagnostics in all cases, indicating a number needed to treat (NNT) of 17 to attain one additional comprehensive diagnosis.

INTERPRETATION

Our findings suggest that CG4 can serve as a valuable diagnostic tool within the domain of inpatient neurology, providing comprehensive and accurate initial diagnoses comparable to those of consultant neurologists. The use of CG4 might contribute to better patient outcomes by serving as an aid in diagnosis and treatment recommendations, potentially leading to reduced missed diagnoses and quicker diagnostic processes. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Further studies with larger sample sizes and independent third-party evaluations are recommended to confirm these findings and assess the impact of LLMs on patient health.

摘要

背景

诸如ChatGPT-4(CG4)之类的大语言模型在医学领域正被证明是有价值的工具,不仅有助于行政任务,还能增强医疗决策。大语言模型此前已通过专家提出的问题和标准化测试数据来测试诊断准确性。在这些研究中,CG4始终优于其他大语言模型,包括ChatGPT-3.5(现已不再公开可用)和谷歌巴德(现称为“谷歌Gemini”)。接下来合乎逻辑的步骤是开始探索CG4在特定临床领域内的准确性。我们的研究评估了CG4在住院神经内科会诊服务中的诊断准确性。

方法

我们对加利福尼亚州科尔顿市箭头区域医疗中心每日神经内科会诊名单上列出的所有患者进行了回顾,涵盖所有调查日期,直至达到总共51名患者,以确保患者群体的完整且具有代表性的样本。ChatGPT-4采用符合《健康保险流通与责任法案》(HIPAA)的方法,将来自Epic电子健康记录(EHR)的患者数据作为输入,并被要求为每位患者提供初步鉴别诊断清单、检查和推荐措施、最终诊断以及治疗计划。然后使用一个综合量表(0至3之间的有序量表)来评估会诊医生与CG4的初步诊断以及会诊医生的最终诊断之间的匹配准确性。在这项概念验证研究中,我们假定神经内科会诊医生的最终诊断是准确的。我们采用非参数自助重采样方法来创建平均得分周围的95%置信区间、费舍尔精确检验、威尔科克森秩和检验以及有序逻辑回归模型,以比较会诊医生组和CG4组之间的表现。

结果

我们的研究发现,CG4表现出与神经内科会诊医生相当的诊断准确性。两组最常达到的综合得分是“3”,神经内科会诊医生达到该得分43次,CG4达到31次。会诊医生组的平均综合得分为2.75(95%置信区间:2.49 - 2.90),CG4组为2.57(95%置信区间:2.31 - 2.67)。综合诊断(得分“2”或“3”)的成功率,会诊医生为94.1%(95%置信区间:84.1% - 98.0%),CG4为96.1%(95%置信区间:86.8% - 98.9%),成功率无统计学显著差异(p = 1.00)。威尔科克森秩和检验表明,会诊医生组提供更全面诊断的可能性更高(W = 1583,p = 0.02)。有序逻辑回归模型确定了诊断准确性的显著预测因素,会诊诊断组对于更高价值结果的优势比为3.68(95%置信区间:1.28 - 10.55)。值得注意的是,将会诊医生的初始诊断与CG4的初始诊断相结合可以在所有病例中实现全面诊断,这表明达到一次额外全面诊断所需治疗人数(NNT)为17。

解读

我们的研究结果表明,CG4可作为住院神经内科领域有价值的诊断工具,提供与神经内科会诊医生相当的全面且准确的初始诊断。使用CG4可能通过辅助诊断和治疗建议有助于改善患者预后,潜在地减少漏诊并加快诊断过程。持续改进大语言模型准确性的策略和评估仍然至关重要。建议进行更大样本量的进一步研究和独立第三方评估,以证实这些发现并评估大语言模型对患者健康的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f112873f27ad/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f4d8cb994dfa/ga1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/4ae31641be85/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/7e3705bfa417/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/a34887e0934b/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/557c730f6248/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f112873f27ad/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f4d8cb994dfa/ga1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/4ae31641be85/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/7e3705bfa417/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/a34887e0934b/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/557c730f6248/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f112873f27ad/gr5.jpg

相似文献

1
ChatGPT4's diagnostic accuracy in inpatient neurology: A retrospective cohort study.ChatGPT4在住院神经内科的诊断准确性:一项回顾性队列研究。
Heliyon. 2024 Dec 9;10(24):e40964. doi: 10.1016/j.heliyon.2024.e40964. eCollection 2024 Dec 30.
2
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
3
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
4
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
5
Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison.大语言模型在白内障护理信息提供方面的评估:定量比较
Ophthalmol Ther. 2025 Jan;14(1):103-116. doi: 10.1007/s40123-024-01066-y. Epub 2024 Nov 8.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
8
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
9
Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT:比较性能分析
JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.
10
Urology consultants versus large language models: Potentials and hazards for medical advice in urology.泌尿外科顾问与大语言模型:泌尿外科医疗建议的潜力与风险
BJUI Compass. 2024 Apr 3;5(5):438-444. doi: 10.1002/bco2.359. eCollection 2024 May.

本文引用的文献

1
Evaluating Large Language Models in extracting cognitive exam dates and scores.评估大语言模型在提取认知测试日期和分数方面的能力。
PLOS Digit Health. 2024 Dec 11;3(12):e0000685. doi: 10.1371/journal.pdig.0000685. eCollection 2024 Dec.
2
Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges.利用大语言模型从电子健康记录中检索证据:可能性与挑战。
Proc Mach Learn Res. 2024 Jun;248:489-505.
3
Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.系统分析 ChatGPT、Google 搜索和 Llama 2 在临床决策支持任务中的应用。
Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.
4
Using Artificial Intelligence to Label Free-Text Operative and Ultrasound Reports for Grading Pediatric Appendicitis.利用人工智能对自由文本手术和超声报告进行标记,以对小儿阑尾炎进行分级。
J Pediatr Surg. 2024 May;59(5):783-790. doi: 10.1016/j.jpedsurg.2024.01.033. Epub 2024 Feb 2.
5
Exploring the potential of ChatGPT as an adjunct for generating diagnosis based on chief complaint and cone beam CT radiologic findings.探索ChatGPT作为辅助工具,根据主诉和锥形束CT放射学检查结果生成诊断的潜力。
BMC Med Inform Decis Mak. 2024 Feb 19;24(1):55. doi: 10.1186/s12911-024-02445-y.
6
Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses.ChatGPT 在国家医师、药师、护士等医学类考试中的表现:一项针对医、药、护人员的五年考试评估研究。
BMC Med Educ. 2024 Feb 14;24(1):143. doi: 10.1186/s12909-024-05125-7.
7
Potential applications and implications of large language models in primary care.大语言模型在初级保健中的潜在应用和影响。
Fam Med Community Health. 2024 Jan 30;12(Suppl 1):e002602. doi: 10.1136/fmch-2023-002602.
8
Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT.深度学习胸部X光解读模型的验证:整合大规模人工智能和大语言模型以与ChatGPT进行对比分析
Diagnostics (Basel). 2023 Dec 30;14(1):90. doi: 10.3390/diagnostics14010090.
9
Geographic Disparities in Access to Neurologists and Multiple Sclerosis Care in the United States.美国神经病学家和多发性硬化症治疗服务可及性的地域差异。
Neurology. 2024 Jan 23;102(2):e207916. doi: 10.1212/WNL.0000000000207916. Epub 2023 Dec 21.
10
GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors.GPT-4能够通过韩国韩医医生国家执照考试。
PLOS Digit Health. 2023 Dec 15;2(12):e0000416. doi: 10.1371/journal.pdig.0000416. eCollection 2023 Dec.