Suppr超能文献

ChatGPT4在住院神经内科的诊断准确性:一项回顾性队列研究。

ChatGPT4's diagnostic accuracy in inpatient neurology: A retrospective cohort study.

作者信息

Cano-Besquet Sebastian, Rice-Canetto Tyler, Abou-El-Hassan Hadi, Alarcon Simon, Zimmerman Jason, Issagholian Leo, Salomon Nasser, Rojas Ivan, Dhahbi Joseph, Neeki Michael M

机构信息

California University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USA.

Department of Emergency Medicine, Arrowhead Regional Medical Center, 400 N. Pepper Ave, Colton, CA, 92324, USA.

出版信息

Heliyon. 2024 Dec 9;10(24):e40964. doi: 10.1016/j.heliyon.2024.e40964. eCollection 2024 Dec 30.

Abstract

BACKGROUND

Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions and standardized test data. Among those studies, CG4 consistently outperformed alternative LLMs, including ChatGPT-3.5 (now no longer publicly available for use) and Google Bard (now known as "Google Gemini"). The next logical step was to begin to explore CG4's accuracy within a specific clinical domain. Our study evaluated the diagnostic accuracy of CG4 within an inpatient neurology consultation service.

METHODS

We conducted a review of all patients listed on the daily neurology consultation roster at Arrowhead Regional Medical Center in Colton, CA, for all days surveyed until we reached a total of 51 patients, ensuring a complete and representative sample of the patient population. ChatGPT-4, using HIPAA-compliant methodology, received patient data from the Epic EHR as input and was asked to provide an initial differential diagnoses list, investigations and recommended actions, a final diagnosis, and a treatment plan for each patient. A comprehensiveness scale (an ordinal scale between 0 and 3) was then used to rate match accuracy between consultant and CG4 initial diagnoses and the consultants' final diagnoses. In this proof-of-concept study, we assumed that the neurology consultants' final diagnoses were accurate. We employed a non-parametric bootstrap resampling method to create 95 % confidence intervals around mean scores, a Fisher's Exact test, a Wilcoxon Rank Sum test, and ordinal logistic regression models to compare the performance between consultant and CG4 groups.

FINDINGS

Our study found that CG4 demonstrated diagnostic accuracy comparable to that of consultant neurologists. The most frequent comprehensiveness score achieved by both groups was "3," with consultant neurologists achieving this score 43 times and CG4 achieving it 31 times. The mean comprehensiveness scores were 2.75 (95 % CI: 2.49-2.90) for the consultant group and 2.57 (95 % CI: 2.31-2.67) for the CG4 group. The success rate for comprehensive diagnoses (a score of "2″ or "3″) was 94.1 % (95 % CI: 84.1%-98.0 %) for consultants and 96.1 % (95 % CI: 86.8%-98.9 %) for CG4, with no statistically significant difference in success rates (p = 1.00). The Wilcoxon Rank Sum Test indicated that the consultant group had a higher likelihood of providing more comprehensive diagnoses (W = 1583, p = 0.02). Ordinal logistic regression models identified significant predictors of diagnostic accuracy, with the consultant diagnosis group showing an odds ratio of 3.68 (CI 95 %: 1.28-10.55) for higher value outcomes. Notably, integrating CG4's initial diagnoses with those from consultants could achieve comprehensive diagnostics in all cases, indicating a number needed to treat (NNT) of 17 to attain one additional comprehensive diagnosis.

INTERPRETATION

Our findings suggest that CG4 can serve as a valuable diagnostic tool within the domain of inpatient neurology, providing comprehensive and accurate initial diagnoses comparable to those of consultant neurologists. The use of CG4 might contribute to better patient outcomes by serving as an aid in diagnosis and treatment recommendations, potentially leading to reduced missed diagnoses and quicker diagnostic processes. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Further studies with larger sample sizes and independent third-party evaluations are recommended to confirm these findings and assess the impact of LLMs on patient health.

摘要

背景

诸如ChatGPT-4(CG4)之类的大语言模型在医学领域正被证明是有价值的工具,不仅有助于行政任务,还能增强医疗决策。大语言模型此前已通过专家提出的问题和标准化测试数据来测试诊断准确性。在这些研究中,CG4始终优于其他大语言模型,包括ChatGPT-3.5(现已不再公开可用)和谷歌巴德(现称为“谷歌Gemini”)。接下来合乎逻辑的步骤是开始探索CG4在特定临床领域内的准确性。我们的研究评估了CG4在住院神经内科会诊服务中的诊断准确性。

方法

我们对加利福尼亚州科尔顿市箭头区域医疗中心每日神经内科会诊名单上列出的所有患者进行了回顾,涵盖所有调查日期,直至达到总共51名患者,以确保患者群体的完整且具有代表性的样本。ChatGPT-4采用符合《健康保险流通与责任法案》(HIPAA)的方法,将来自Epic电子健康记录(EHR)的患者数据作为输入,并被要求为每位患者提供初步鉴别诊断清单、检查和推荐措施、最终诊断以及治疗计划。然后使用一个综合量表(0至3之间的有序量表)来评估会诊医生与CG4的初步诊断以及会诊医生的最终诊断之间的匹配准确性。在这项概念验证研究中,我们假定神经内科会诊医生的最终诊断是准确的。我们采用非参数自助重采样方法来创建平均得分周围的95%置信区间、费舍尔精确检验、威尔科克森秩和检验以及有序逻辑回归模型,以比较会诊医生组和CG4组之间的表现。

结果

我们的研究发现,CG4表现出与神经内科会诊医生相当的诊断准确性。两组最常达到的综合得分是“3”,神经内科会诊医生达到该得分43次,CG4达到31次。会诊医生组的平均综合得分为2.75(95%置信区间:2.49 - 2.90),CG4组为2.57(95%置信区间:2.31 - 2.67)。综合诊断(得分“2”或“3”)的成功率,会诊医生为94.1%(95%置信区间:84.1% - 98.0%),CG4为96.1%(95%置信区间:86.8% - 98.9%),成功率无统计学显著差异(p = 1.00)。威尔科克森秩和检验表明,会诊医生组提供更全面诊断的可能性更高(W = 1583,p = 0.02)。有序逻辑回归模型确定了诊断准确性的显著预测因素,会诊诊断组对于更高价值结果的优势比为3.68(95%置信区间:1.28 - 10.55)。值得注意的是,将会诊医生的初始诊断与CG4的初始诊断相结合可以在所有病例中实现全面诊断,这表明达到一次额外全面诊断所需治疗人数(NNT)为17。

解读

我们的研究结果表明,CG4可作为住院神经内科领域有价值的诊断工具,提供与神经内科会诊医生相当的全面且准确的初始诊断。使用CG4可能通过辅助诊断和治疗建议有助于改善患者预后,潜在地减少漏诊并加快诊断过程。持续改进大语言模型准确性的策略和评估仍然至关重要。建议进行更大样本量的进一步研究和独立第三方评估,以证实这些发现并评估大语言模型对患者健康的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00b2/11699242/f4d8cb994dfa/ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验