Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
Urasoe General Hospital, Urasoe, Okinawa, Japan.
BMJ Open Qual. 2024 Jun 3;13(2):e002654. doi: 10.1136/bmjoq-2023-002654.
Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.
This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.
We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.
ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.
ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.
使用经过验证的评估工具进行手动图表审查是检测诊断错误的标准化方法。然而,这需要大量的人力资源和时间。ChatGPT 是一种最近开发的基于大型语言模型的人工智能聊天机器人,它可以根据合适的提示有效地对文本进行分类。因此,ChatGPT 可以协助手动图表审查检测诊断错误。
本研究旨在阐明 ChatGPT 是否可以根据病例报告正确检测诊断错误以及可能导致这些错误的因素。
我们分析了 545 篇包含诊断错误的已发表病例报告。我们将病例报告和最终诊断的文本与一些原始提示一起输入 ChatGPT(GPT-4)以生成响应,包括诊断错误的判断和诊断错误的促成因素。根据以下三个分类法对促成诊断错误的因素进行编码:诊断错误评估与研究 (DEER)、可靠诊断挑战 (RDC) 和通用诊断陷阱 (GDP)。ChatGPT 对促成因素的回复与医生的回复进行了比较。
ChatGPT 正确检测到 545 例中的 519 例(95%)诊断错误,并对每个病例编码的促成诊断错误的因素数量统计上大于医生:DEER(中位数 5 比 1,p<0.001)、RDC(中位数 4 比 2,p<0.001)和 GDP(中位数 4 比 1,p<0.001)。ChatGPT 编码的诊断错误最重要的促成因素是 DEER 中的“未能/延迟考虑诊断”(315 例,57.8%)、RDC 中的“非典型表现”(365 例,67.0%)和 GDP 中的“非典型表现”(264 例,48.4%)。
ChatGPT 可以准确地从病例报告中检测诊断错误。ChatGPT 在检测诊断错误的促成因素方面可能比手动审查更敏感,尤其是对于“非典型表现”。