• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于人工智能的症状检查器生成的鉴别诊断列表诊断准确性的纵向变化:回顾性观察研究

Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study.

作者信息

Harada Yukinori, Sakamoto Tetsu, Sugimoto Shu, Shimizu Taro

机构信息

Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan.

Department of General Medicine, Nagano Chuo Hospital, Nagano, Japan.

出版信息

JMIR Form Res. 2024 May 17;8:e53985. doi: 10.2196/53985.

DOI:10.2196/53985
PMID:38758588
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11143391/
Abstract

BACKGROUND

Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited.

OBJECTIVE

This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world.

METHODS

This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year).

RESULTS

A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker.

CONCLUSIONS

A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2113/11143391/d27258e30418/formative_v8i1e53985_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2113/11143391/d27258e30418/formative_v8i1e53985_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2113/11143391/d27258e30418/formative_v8i1e53985_fig1.jpg
摘要

背景

人工智能(AI)症状检查模型应使用真实世界的患者数据进行训练,以提高其诊断准确性。鉴于基于AI的症状检查器目前已应用于临床实践,其性能应会随着时间的推移而提高。然而,对这些症状检查器诊断准确性的纵向评估有限。

目的

本研究旨在评估在现实世界中使用的基于AI的症状检查器所创建的鉴别诊断列表准确性的纵向变化。

方法

这是一项单中心、回顾性观察研究。2019年5月1日至2022年4月30日期间未预约就诊于门诊诊所,且在首次就诊后30天内入住日本社区医院的患者被视为符合条件。我们仅纳入了在首次就诊时接受基于AI症状检查的患者,且诊断在随访期间最终得以确认。最终诊断分为常见或不常见,所有病例分为典型或非典型。主要结局指标是基于AI的症状检查器所创建的鉴别诊断列表的准确性,定义为症状检查器创建的10个鉴别诊断列表中的最终诊断。为评估症状检查器诊断准确性在3年中的变化,我们使用卡方检验比较3个时间段的主要结局:2019年5月1日至2020年4月30日(第一年);2020年5月1日至2021年4月30日(第二年);以及2021年5月1日至2022年4月30日(第三年)。

结果

共纳入381例患者。常见疾病包括257例(67.5%),298例(78.2%)观察到典型表现。总体而言,基于AI的症状检查器所创建的鉴别诊断列表的准确性为172例(45.1%),在3年中无差异(第一年:97/219,44.3%;第二年:32/72,4,4.4%;第三年:43/90,47.7%;P = 0.85)。症状检查器所创建的鉴别诊断列表在患有不常见疾病(30/124,24.2%)和非典型表现(12/83,14.5%)的患者中准确性较低。在多变量逻辑回归模型中,常见疾病(P < 0.001;比值比4.13,95% CI 2.50 - 6.98)和典型表现(P < 0.001;比值比6.92,95% CI 3.62 - 14.2)与症状检查器所创建的鉴别诊断列表的准确性显著相关。

结论

对在现实世界临床实践环境中实施的基于AI的症状检查器所开发的鉴别诊断列表的诊断准确性进行的为期3年的纵向调查显示,其准确性并未随时间提高。不常见疾病和非典型表现与较低的诊断准确性独立相关。未来,应训练症状检查器识别不常见病症。

相似文献

1
Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study.基于人工智能的症状检查器生成的鉴别诊断列表诊断准确性的纵向变化:回顾性观察研究
JMIR Form Res. 2024 May 17;8:e53985. doi: 10.2196/53985.
2
Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study.评估症状检查器的诊断性能:临床病例研究。
JMIR AI. 2024 Apr 29;3:e46875. doi: 10.2196/46875.
3
Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study.在急诊科评估诊断和分诊准确性及症状检查器的可用性:观察性研究。
JMIR Mhealth Uhealth. 2022 Sep 19;10(9):e38364. doi: 10.2196/38364.
4
Online symptom checker diagnostic and triage accuracy for HIV and hepatitis C.在线症状检查器对 HIV 和丙型肝炎的诊断和分诊准确性。
Epidemiol Infect. 2019 Jan;147:e104. doi: 10.1017/S0950268819000268.
5
Accuracy of a Popular Online Symptom Checker for Ophthalmic Diagnoses.一种广受欢迎的在线症状检查器在眼科诊断中的准确性。
JAMA Ophthalmol. 2019 Jun 1;137(6):690-692. doi: 10.1001/jamaophthalmol.2019.0571.
6
Determinants of Laypersons' Trust in Medical Decision Aids: Randomized Controlled Trial.非专业人士对医疗决策辅助工具信任度的决定因素:随机对照试验
JMIR Hum Factors. 2022 May 3;9(2):e35219. doi: 10.2196/35219.
7
Patient Perspectives on the Usefulness of an Artificial Intelligence-Assisted Symptom Checker: Cross-Sectional Survey Study.患者对人工智能辅助症状检查器有用性的看法:横断面调查研究
J Med Internet Res. 2020 Jan 30;22(1):e14679. doi: 10.2196/14679.
8
Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy.医生和基于人工智能的症状检查器诊断准确性比较。
Rheumatol Int. 2022 Dec;42(12):2167-2176. doi: 10.1007/s00296-022-05202-4. Epub 2022 Sep 10.
9
Assessment of a Digital Symptom Checker Tool's Accuracy in Suggesting Reproductive Health Conditions: Clinical Vignettes Study.评估数字症状检查工具在提示生殖健康状况方面的准确性:临床病例研究。
JMIR Mhealth Uhealth. 2023 Dec 5;11:e46718. doi: 10.2196/46718.
10
Incidence of Diagnostic Errors Among Unexpectedly Hospitalized Patients Using an Automated Medical History-Taking System With a Differential Diagnosis Generator: Retrospective Observational Study.使用带有鉴别诊断生成器的自动病史采集系统的意外住院患者中诊断错误的发生率:回顾性观察研究。
JMIR Med Inform. 2022 Jan 27;10(1):e35225. doi: 10.2196/35225.

引用本文的文献

1
Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.医学专家知识与人工智能相结合,以提高法布里病罕见病识别症状检查器的性能:混合方法研究。
JMIR AI. 2025 Aug 28;4:e55001. doi: 10.2196/55001.
2
Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports.对比分析诊断性能:基于案例报告的 LLaMA3 与 LLaMA2 的鉴别诊断列表。
JMIR Form Res. 2024 Nov 19;8:e64844. doi: 10.2196/64844.
3
Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study.

本文引用的文献

1
Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.系统分析 ChatGPT、Google 搜索和 Llama 2 在临床决策支持任务中的应用。
Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.
2
Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties.GPT-4 和 GPT-3.5 在生成跨医学专科准确全面诊断方面的表现。
J Chin Med Assoc. 2024 Mar 1;87(3):259-260. doi: 10.1097/JCMA.0000000000001064. Epub 2024 Feb 2.
3
Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases.
评估Gemini Advanced、Gemini和Bard生成的鉴别诊断列表准确性的比较研究:用于病例报告系列分析的横断面研究。
JMIR Med Inform. 2024 Oct 2;12:e63010. doi: 10.2196/63010.
4
Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.系统基准测试表明,大语言模型尚未达到传统罕见病决策支持工具的诊断准确性。
medRxiv. 2024 Nov 7:2024.07.22.24310816. doi: 10.1101/2024.07.22.24310816.
对标 ChatGPT 在广泛疾病领域的症状自查能力。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2084-2088. doi: 10.1093/jamia/ocad245.
4
Will Generative Artificial Intelligence Deliver on Its Promise in Health Care?生成式人工智能能否在医疗保健领域兑现其承诺?
JAMA. 2024 Jan 2;331(1):65-69. doi: 10.1001/jama.2023.25054.
5
ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.基于复杂病例临床案例生成的ChatGPT鉴别诊断列表:诊断准确性评估。
JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.
6
Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study.Ada 健康和 WebMD 症状检查器、ChatGPT 和医生对急诊科患者的诊断和分诊准确性比较:临床数据分析研究。
JMIR Mhealth Uhealth. 2023 Oct 3;11:e49995. doi: 10.2196/49995.
7
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.
8
Harnessing the Promise of Artificial Intelligence Responsibly.负责任地利用人工智能的前景。
JAMA. 2023 Apr 25;329(16):1347-1348. doi: 10.1001/jama.2023.2771.
9
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study.基于生成式预训练 Transformer 3 聊天机器人为常见主诉临床病例生成鉴别诊断列表的诊断准确性:一项初步研究。
Int J Environ Res Public Health. 2023 Feb 15;20(4):3378. doi: 10.3390/ijerph20043378.
10
Effect of contextual factors on the prevalence of diagnostic errors among patients managed by physicians of the same specialty: a single-centre retrospective observational study.语境因素对同一专业医生管理的患者中诊断错误发生率的影响:一项单中心回顾性观察研究。
BMJ Qual Saf. 2024 May 17;33(6):386-394. doi: 10.1136/bmjqs-2022-015436.