Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany.
Epidemiology Unit, German Rheumatism Research Centre, Berlin, Germany.
Rheumatol Int. 2024 Feb;44(2):303-306. doi: 10.1007/s00296-023-05464-6. Epub 2023 Sep 24.
Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Gräf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists' assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist's assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.
临床前研究表明,大型语言模型(例如 ChatGPT)可用于诊断过程,以区分炎症性风湿病(IRD)和其他疾病。因此,我们旨在评估 ChatGPT-4 与风湿病专家相比的诊断准确性。为此分析使用了 Gräf 等人的数据集。(2022 年)。使用 ChatGPT-4 分析了先前的患者评估,并与风湿病专家的评估进行了比较。ChatGPT-4 将正确的诊断列为与风湿病专家一样常见的顶级诊断,分别为 35%和 39%(p=0.30);在前三名诊断中,60%和 55%,(p=0.38)。在 IRD 阳性病例中,ChatGPT-4 提供的顶级诊断在风湿病专家分析中为 71%,而 62%。86%的病例(ChatGPT-4)和 74%的病例(风湿病专家)正确诊断为前 3 名。在非 IRD 病例中,ChatGPT-4 在风湿病专家分析中的正确顶级诊断为 15%,而 27%。在非 IRD 病例中,ChatGPT-4 组的前 3 名中有 46%,风湿病专家组中有 45%的病例正确诊断。如果仅考虑第一个诊断建议,ChatGPT-4 将 58%的病例正确归类为 IRD,而风湿病专家的比例为 56%(p=0.52)。与风湿病专家评估相比,ChatGPT-4 对前 3 名总体诊断的准确性略高。ChatGPT-4 能够在相当多的病例中提供正确的鉴别诊断,并在检测 IRD 方面具有比风湿病专家更高的敏感性,但其特异性较低。初步结果突出了这项新技术作为 IRD 诊断分诊工具的潜力。