• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Ada 健康和 WebMD 症状检查器、ChatGPT 和医生对急诊科患者的诊断和分诊准确性比较:临床数据分析研究。

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.

机构信息

Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States.

Department of Health Services, Policy and Practice, Brown University School of Public Health, Providence, RI, United States.

出版信息

JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.

DOI:10.2196/49995
PMID:37788063
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10582809/
Abstract

BACKGROUND

Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.

OBJECTIVE

The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.

METHODS

We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious."

RESULTS

Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).

CONCLUSIONS

ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/0671a37e37b5/mhealth_v11i1e49995_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/381e84d8e7d8/mhealth_v11i1e49995_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/0671a37e37b5/mhealth_v11i1e49995_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/381e84d8e7d8/mhealth_v11i1e49995_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/0671a37e37b5/mhealth_v11i1e49995_fig2.jpg
摘要

背景

诊断是有效医疗保健的核心组成部分,但误诊很常见,可能使患者面临风险。诊断决策支持系统可以通过医生和其他医疗保健工作者来提高诊断水平。症状检查器(SC)旨在通过患者提高诊断和分诊(即寻求哪种级别的护理)。

目的

本研究旨在评估新的大型语言模型 ChatGPT(版本 3.5 和 4.0)、广受欢迎的 WebMD SC 和 Ada Health 开发的 SC 在诊断和分诊紧急或紧急临床问题患者方面的性能,与最终急诊(ED)诊断和医生审查进行比较。

方法

我们使用了之前从 40 名到 ED 就诊的患者中收集的匿名自我报告数据,这些患者在看到 ED 医生之前使用 Ada SC 记录他们的症状。匿名数据由一名研究助理输入到 ChatGPT 版本 3.5 和 4.0 以及 WebMD 中,该研究助理对诊断和分诊不知情。比较所有 4 个系统的诊断与 ED 之前提取的最终诊断以及来自 3 位独立的 ED 认证医生的诊断和分诊建议,这些医生已经对 Ada 的自我报告临床数据进行了盲审。诊断准确性计算为 ChatGPT、Ada SC、WebMD SC 和独立医生的诊断相匹配的比例(分层为前 1 名或前 3 名)。分诊准确性计算为 ChatGPT、WebMD 或 Ada 的推荐与至少 2 位独立医生一致的次数,或被评为“不安全”或“过于谨慎”的次数。

结果

总体而言,分别有 30 例和 37 例病例有足够的数据进行诊断和分诊分析。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-1 诊断匹配率分别为 9(30%)、12(40%)、10(33%)和 12(40%),医生的平均率为 47%。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-3 诊断匹配率分别为 19(63%)、19(63%)、15(50%)和 17(57%),医生的平均率为 69%。Ada 的分诊结果分布为 62%(n=23)同意、14%(n=5)不安全和 24%(n=9)过于谨慎;ChatGPT 3.5 为 59%(n=22)同意、41%(n=15)不安全和 0%(n=0)过于谨慎;ChatGPT 4.0 为 76%(n=28)同意、22%(n=8)不安全和 3%(n=1)过于谨慎;WebMD 为 70%(n=26)同意、19%(n=7)不安全和 11%(n=4)过于谨慎。ChatGPT 3.5 的不安全分诊率(41%)明显高于 Ada(14%)(P=.009)。

结论

ChatGPT 3.5 的诊断准确性较高,但不安全分诊率较高。ChatGPT 4.0 的诊断准确性最差,但不安全分诊率较低,与医生的分诊一致性最高。Ada 和 WebMD SC 总体上表现优于 ChatGPT。未经改进分诊准确性和广泛临床评估,不建议未经监督的患者使用 ChatGPT 进行诊断和分诊。

相似文献

1
Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.Ada 健康和 WebMD 症状检查器、ChatGPT 和医生对急诊科患者的诊断和分诊准确性比较:临床数据分析研究。
JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.
2
Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study.在急诊科评估诊断和分诊准确性及症状检查器的可用性:观察性研究。
JMIR Mhealth Uhealth. 2022 Sep 19;10(9):e38364. doi: 10.2196/38364.
3
Comparison of Two Symptom Checkers (Ada and Symptoma) in the Emergency Department: Randomized, Crossover, Head-to-Head, Double-Blinded Study.比较两款症状检查器(Ada 和 Symptoma)在急诊科的表现:随机、交叉、头对头、双盲研究。
J Med Internet Res. 2024 Aug 20;26:e56514. doi: 10.2196/56514.
4
Online symptom checker diagnostic and triage accuracy for HIV and hepatitis C.在线症状检查器对 HIV 和丙型肝炎的诊断和分诊准确性。
Epidemiol Infect. 2019 Jan;147:e104. doi: 10.1017/S0950268819000268.
5
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
6
ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis.ChatGPT 联合 GPT-4 在诊断准确率上优于急诊科医生:回顾性分析。
J Med Internet Res. 2024 Jul 8;26:e56110. doi: 10.2196/56110.
7
Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.混合方法评估人口统计学因素对 ChatGPT 提供医疗建议的影响。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2002-2009. doi: 10.1093/jamia/ocae086.
8
A Symptom-Checker for Adult Patients Visiting an Interdisciplinary Emergency Care Center and the Safety of Patient Self-Triage: Real-Life Prospective Evaluation.成人患者就诊于多学科急诊中心的症状自查工具和患者自我分诊的安全性:真实世界前瞻性评估。
J Med Internet Res. 2024 Jun 27;26:e58157. doi: 10.2196/58157.
9
Young Adults' Perspectives on the Use of Symptom Checkers for Self-Triage and Self-Diagnosis: Qualitative Study.年轻人对使用症状检查器进行自我分诊和自我诊断的看法:定性研究。
JMIR Public Health Surveill. 2021 Jan 6;7(1):e22637. doi: 10.2196/22637.
10
Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0.提高急诊室中转移性前列腺癌患者的分诊效率和准确性:使用ChatGPT 4.0的人工智能辅助分诊的回顾性分析
Cancers (Basel). 2023 Jul 22;15(14):3717. doi: 10.3390/cancers15143717.

引用本文的文献

1
Explainable AI in medicine: challenges of integrating XAI into the future clinical routine.医学中的可解释人工智能:将可解释人工智能集成到未来临床常规中的挑战。
Front Radiol. 2025 Aug 5;5:1627169. doi: 10.3389/fradi.2025.1627169. eCollection 2025.
2
Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.医学专家知识与人工智能相结合,以提高法布里病罕见病识别症状检查器的性能:混合方法研究。
JMIR AI. 2025 Aug 28;4:e55001. doi: 10.2196/55001.
3
Clinical applications of large language models in medicine and surgery: A scoping review.

本文引用的文献

1
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
2
Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).ChatGPT 在医学中作为 AI 辅助决策支持工具的性能:解释常见心脏疾病症状和管理的概念验证研究 (AMSTELHEART-2)。
Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.
3
大型语言模型在医学与外科中的临床应用:一项范围综述
J Int Med Res. 2025 Jul;53(7):3000605251347556. doi: 10.1177/03000605251347556. Epub 2025 Jul 4.
4
ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings.在紧急情况下,ChatGPT-01预览版作为踝关节疼痛分诊的诊断支持工具,其表现优于ChatGPT-4。
Arch Acad Emerg Med. 2025 Apr 5;13(1):e42. doi: 10.22037/aaemj.v13i1.2580. eCollection 2025.
5
Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.用于临床诊断的专用人工智能专家系统与具有大语言模型的生成式人工智能对比
JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.
6
Extracting Multifaceted Characteristics of Patients With Chronic Disease Comorbidity: Framework Development Using Large Language Models.提取慢性病合并症患者的多方面特征:使用大语言模型进行框架开发
JMIR Med Inform. 2025 May 15;13:e70096. doi: 10.2196/70096.
7
A Practical Guide to the Utilization of ChatGPT in the Emergency Department: A Systematic Review of Current Applications, Future Directions, and Limitations.急诊科使用ChatGPT实用指南:当前应用、未来方向及局限性的系统评价
Cureus. 2025 Apr 6;17(4):e81802. doi: 10.7759/cureus.81802. eCollection 2025 Apr.
8
Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology.用于胃肠病学中大型语言模型安全的专家验证与对齐(EVAL)框架。
NPJ Digit Med. 2025 May 3;8(1):242. doi: 10.1038/s41746-025-01589-z.
9
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
10
Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis.比较临床专业人员和大语言模型的诊断准确性:系统评价与荟萃分析
JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.
Large language model AI chatbots require approval as medical devices.
大型语言模型人工智能聊天机器人需作为医疗设备获得批准。
Nat Med. 2023 Oct;29(10):2396-2398. doi: 10.1038/s41591-023-02412-6.
4
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.在全科医疗中使用应用知识测试对大型语言模型(ChatGPT)进行试验:观察性研究揭示初级保健中的机遇与局限
JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.
5
Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.评估 ChatGPT 在医疗保健中的可行性:对多个临床和研究场景的分析。
J Med Syst. 2023 Mar 4;47(1):33. doi: 10.1007/s10916-023-01925-4.
6
Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.ChatGPT中的人工幻觉:对科学写作的影响
Cureus. 2023 Feb 19;15(2):e35179. doi: 10.7759/cureus.35179. eCollection 2023 Feb.
7
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
8
Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study.在急诊科评估诊断和分诊准确性及症状检查器的可用性:观察性研究。
JMIR Mhealth Uhealth. 2022 Sep 19;10(9):e38364. doi: 10.2196/38364.
9
Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation.症状检查器应用程序的分诊准确性:5 年随访评估。
J Med Internet Res. 2022 May 10;24(5):e31810. doi: 10.2196/31810.
10
Safety of Triage Self-assessment Using a Symptom Assessment App for Walk-in Patients in the Emergency Care Setting: Observational Prospective Cross-sectional Study.在急诊环境中使用症状评估应用程序对就诊患者进行分诊自我评估的安全性:观察性前瞻性横断面研究。
JMIR Mhealth Uhealth. 2022 Mar 28;10(3):e32340. doi: 10.2196/32340.