Ada 健康和 WebMD 症状检查器、ChatGPT 和医生对急诊科患者的诊断和分诊准确性比较：临床数据分析研究。

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.

机构信息

Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States.

Department of Health Services, Policy and Practice, Brown University School of Public Health, Providence, RI, United States.

出版信息

JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.

DOI:10.2196/49995

PMID:37788063

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10582809/

Abstract

BACKGROUND

Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.

OBJECTIVE

The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.

METHODS

We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious."

RESULTS

Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).

CONCLUSIONS

ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.

摘要

背景

诊断是有效医疗保健的核心组成部分，但误诊很常见，可能使患者面临风险。诊断决策支持系统可以通过医生和其他医疗保健工作者来提高诊断水平。症状检查器（SC）旨在通过患者提高诊断和分诊（即寻求哪种级别的护理）。

目的

本研究旨在评估新的大型语言模型 ChatGPT（版本 3.5 和 4.0）、广受欢迎的 WebMD SC 和 Ada Health 开发的 SC 在诊断和分诊紧急或紧急临床问题患者方面的性能，与最终急诊（ED）诊断和医生审查进行比较。

方法

我们使用了之前从 40 名到 ED 就诊的患者中收集的匿名自我报告数据，这些患者在看到 ED 医生之前使用 Ada SC 记录他们的症状。匿名数据由一名研究助理输入到 ChatGPT 版本 3.5 和 4.0 以及 WebMD 中，该研究助理对诊断和分诊不知情。比较所有 4 个系统的诊断与 ED 之前提取的最终诊断以及来自 3 位独立的 ED 认证医生的诊断和分诊建议，这些医生已经对 Ada 的自我报告临床数据进行了盲审。诊断准确性计算为 ChatGPT、Ada SC、WebMD SC 和独立医生的诊断相匹配的比例（分层为前 1 名或前 3 名）。分诊准确性计算为 ChatGPT、WebMD 或 Ada 的推荐与至少 2 位独立医生一致的次数，或被评为“不安全”或“过于谨慎”的次数。

结果

总体而言，分别有 30 例和 37 例病例有足够的数据进行诊断和分诊分析。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-1 诊断匹配率分别为 9（30%）、12（40%）、10（33%）和 12（40%），医生的平均率为 47%。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-3 诊断匹配率分别为 19（63%）、19（63%）、15（50%）和 17（57%），医生的平均率为 69%。Ada 的分诊结果分布为 62%（n=23）同意、14%（n=5）不安全和 24%（n=9）过于谨慎；ChatGPT 3.5 为 59%（n=22）同意、41%（n=15）不安全和 0%（n=0）过于谨慎；ChatGPT 4.0 为 76%（n=28）同意、22%（n=8）不安全和 3%（n=1）过于谨慎；WebMD 为 70%（n=26）同意、19%（n=7）不安全和 11%（n=4）过于谨慎。ChatGPT 3.5 的不安全分诊率（41%）明显高于 Ada（14%）（P=.009）。

结论

ChatGPT 3.5 的诊断准确性较高，但不安全分诊率较高。ChatGPT 4.0 的诊断准确性最差，但不安全分诊率较低，与医生的分诊一致性最高。Ada 和 WebMD SC 总体上表现优于 ChatGPT。未经改进分诊准确性和广泛临床评估，不建议未经监督的患者使用 ChatGPT 进行诊断和分诊。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a64/10582809/381e84d8e7d8/mhealth_v11i1e49995_fig1.jpg

相似文献

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.

JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.

Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study.

JMIR Mhealth Uhealth. 2022 Sep 19;10(9):e38364. doi: 10.2196/38364.

Comparison of Two Symptom Checkers (Ada and Symptoma) in the Emergency Department: Randomized, Crossover, Head-to-Head, Double-Blinded Study.

J Med Internet Res. 2024 Aug 20;26:e56514. doi: 10.2196/56514.

Online symptom checker diagnostic and triage accuracy for HIV and hepatitis C.

Epidemiol Infect. 2019 Jan;147:e104. doi: 10.1017/S0950268819000268.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis.

J Med Internet Res. 2024 Jul 8;26:e56110. doi: 10.2196/56110.

Mixed methods assessment of the influence of demographics on medical advice of ChatGPT.

J Am Med Inform Assoc. 2024 Sep 1;31(9):2002-2009. doi: 10.1093/jamia/ocae086.

A Symptom-Checker for Adult Patients Visiting an Interdisciplinary Emergency Care Center and the Safety of Patient Self-Triage: Real-Life Prospective Evaluation.

J Med Internet Res. 2024 Jun 27;26:e58157. doi: 10.2196/58157.

Young Adults' Perspectives on the Use of Symptom Checkers for Self-Triage and Self-Diagnosis: Qualitative Study.

JMIR Public Health Surveill. 2021 Jan 6;7(1):e22637. doi: 10.2196/22637.

Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0.

Cancers (Basel). 2023 Jul 22;15(14):3717. doi: 10.3390/cancers15143717.

引用本文的文献

Explainable AI in medicine: challenges of integrating XAI into the future clinical routine.

Front Radiol. 2025 Aug 5;5:1627169. doi: 10.3389/fradi.2025.1627169. eCollection 2025.

Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.

JMIR AI. 2025 Aug 28;4:e55001. doi: 10.2196/55001.

Clinical applications of large language models in medicine and surgery: A scoping review.

J Int Med Res. 2025 Jul;53(7):3000605251347556. doi: 10.1177/03000605251347556. Epub 2025 Jul 4.

ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings.

Arch Acad Emerg Med. 2025 Apr 5;13(1):e42. doi: 10.22037/aaemj.v13i1.2580. eCollection 2025.

Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.

JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.

Extracting Multifaceted Characteristics of Patients With Chronic Disease Comorbidity: Framework Development Using Large Language Models.

JMIR Med Inform. 2025 May 15;13:e70096. doi: 10.2196/70096.

A Practical Guide to the Utilization of ChatGPT in the Emergency Department: A Systematic Review of Current Applications, Future Directions, and Limitations.

Cureus. 2025 Apr 6;17(4):e81802. doi: 10.7759/cureus.81802. eCollection 2025 Apr.

Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology.

NPJ Digit Med. 2025 May 3;8(1):242. doi: 10.1038/s41746-025-01589-z.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis.

JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.

本文引用的文献

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.

Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.

Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).

Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.

Large language model AI chatbots require approval as medical devices.

Nat Med. 2023 Oct;29(10):2396-2398. doi: 10.1038/s41591-023-02412-6.

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.

JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599.

Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.

J Med Syst. 2023 Mar 4;47(1):33. doi: 10.1007/s10916-023-01925-4.

Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.

Cureus. 2023 Feb 19;15(2):e35179. doi: 10.7759/cureus.35179. eCollection 2023 Feb.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study.

JMIR Mhealth Uhealth. 2022 Sep 19;10(9):e38364. doi: 10.2196/38364.

Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation.

J Med Internet Res. 2022 May 10;24(5):e31810. doi: 10.2196/31810.

Safety of Triage Self-assessment Using a Symptom Assessment App for Walk-in Patients in the Emergency Care Setting: Observational Prospective Cross-sectional Study.

JMIR Mhealth Uhealth. 2022 Mar 28;10(3):e32340. doi: 10.2196/32340.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Ada 健康和 WebMD 症状检查器、ChatGPT 和医生对急诊科患者的诊断和分诊准确性比较：临床数据分析研究。

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献