Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States.
Department of Health Services, Policy and Practice, Brown University School of Public Health, Providence, RI, United States.
JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.
Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.
The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.
We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious."
Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).
ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.
诊断是有效医疗保健的核心组成部分,但误诊很常见,可能使患者面临风险。诊断决策支持系统可以通过医生和其他医疗保健工作者来提高诊断水平。症状检查器(SC)旨在通过患者提高诊断和分诊(即寻求哪种级别的护理)。
本研究旨在评估新的大型语言模型 ChatGPT(版本 3.5 和 4.0)、广受欢迎的 WebMD SC 和 Ada Health 开发的 SC 在诊断和分诊紧急或紧急临床问题患者方面的性能,与最终急诊(ED)诊断和医生审查进行比较。
我们使用了之前从 40 名到 ED 就诊的患者中收集的匿名自我报告数据,这些患者在看到 ED 医生之前使用 Ada SC 记录他们的症状。匿名数据由一名研究助理输入到 ChatGPT 版本 3.5 和 4.0 以及 WebMD 中,该研究助理对诊断和分诊不知情。比较所有 4 个系统的诊断与 ED 之前提取的最终诊断以及来自 3 位独立的 ED 认证医生的诊断和分诊建议,这些医生已经对 Ada 的自我报告临床数据进行了盲审。诊断准确性计算为 ChatGPT、Ada SC、WebMD SC 和独立医生的诊断相匹配的比例(分层为前 1 名或前 3 名)。分诊准确性计算为 ChatGPT、WebMD 或 Ada 的推荐与至少 2 位独立医生一致的次数,或被评为“不安全”或“过于谨慎”的次数。
总体而言,分别有 30 例和 37 例病例有足够的数据进行诊断和分诊分析。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-1 诊断匹配率分别为 9(30%)、12(40%)、10(33%)和 12(40%),医生的平均率为 47%。Ada、ChatGPT 3.5、ChatGPT 4.0 和 WebMD 的 top-3 诊断匹配率分别为 19(63%)、19(63%)、15(50%)和 17(57%),医生的平均率为 69%。Ada 的分诊结果分布为 62%(n=23)同意、14%(n=5)不安全和 24%(n=9)过于谨慎;ChatGPT 3.5 为 59%(n=22)同意、41%(n=15)不安全和 0%(n=0)过于谨慎;ChatGPT 4.0 为 76%(n=28)同意、22%(n=8)不安全和 3%(n=1)过于谨慎;WebMD 为 70%(n=26)同意、19%(n=7)不安全和 11%(n=4)过于谨慎。ChatGPT 3.5 的不安全分诊率(41%)明显高于 Ada(14%)(P=.009)。
ChatGPT 3.5 的诊断准确性较高,但不安全分诊率较高。ChatGPT 4.0 的诊断准确性最差,但不安全分诊率较低,与医生的分诊一致性最高。Ada 和 WebMD SC 总体上表现优于 ChatGPT。未经改进分诊准确性和广泛临床评估,不建议未经监督的患者使用 ChatGPT 进行诊断和分诊。