From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.).
Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.
Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with medical diagnosis. Purpose To investigate the viability of leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard. Materials and Methods US images of thyroid nodules with pathologic results were retrospectively collected from a tertiary referral hospital between July 2022 and December 2022 and used to evaluate malignancy diagnoses generated by three LLMs-OpenAI's ChatGPT 3.5, ChatGPT 4.0, and Google's Bard. Inter- and intra-LLM agreement of diagnosis were evaluated. Then, diagnostic performance, including accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), was evaluated and compared for the LLMs and three interactive approaches: human reader combined with LLMs, image-to-text model combined with LLMs, and an end-to-end convolutional neural network model. Results A total of 1161 US images of thyroid nodules (498 benign, 663 malignant) from 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) were evaluated. ChatGPT 4.0 and Bard displayed substantial to almost perfect intra-LLM agreement (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68 [95% CI: 0.36, 0.68]). ChatGPT 4.0 had an accuracy of 78%-86% (95% CI: 76%, 88%) and sensitivity of 86%-95% (95% CI: 83%, 96%), compared with 74%-86% (95% CI: 71%, 88%) and 74%-91% (95% CI: 71%, 93%), respectively, for Bard. Moreover, with ChatGPT 4.0, the image-to-text-LLM strategy exhibited an AUC (0.83 [95% CI: 0.80, 0.85]) and accuracy (84% [95% CI: 82%, 86%]) comparable to those of the human-LLM interaction strategy with two senior readers and one junior reader and exceeding those of the human-LLM interaction strategy with one junior reader. Conclusion LLMs, particularly integrated with image-to-text approaches, show potential in enhancing diagnostic medical imaging. ChatGPT 4.0 was optimal for consistency and diagnostic accuracy when compared with Bard and ChatGPT 3.5. © RSNA, 2024
背景 大型语言模型(LLMs)在医学影像解释方面具有很大的潜力。然而,关于它们在处理与医学诊断相关的推理问题方面的可行性的研究还很少。目的 研究利用三个公开可用的 LLM 来提高医学影像基于标准化报告的一致性和诊断准确性的可行性,以病理学为参考标准。材料与方法 回顾性收集了 2022 年 7 月至 2022 年 12 月期间一家三级转诊医院的甲状腺结节 US 图像,并附有病理结果,用于评估三个 LLM-OpenAI 的 ChatGPT 3.5、ChatGPT 4.0 和 Google 的 Bard-生成的恶性肿瘤诊断。评估了诊断的组内和组间一致性。然后,评估并比较了 LLM 与三种交互方法的诊断性能,包括准确性、敏感度、特异性和接收者操作特征曲线下的面积(AUC):人类读者与 LLM 结合、图像到文本模型与 LLM 结合以及端到端卷积神经网络模型。结果 共评估了 725 名患者(平均年龄 42.2 岁±14.1[SD];516 名女性)的 1161 个甲状腺结节的 US 图像(498 个良性,663 个恶性)。ChatGPT 4.0 和 Bard 显示出几乎完美的内部 LLM 一致性(κ 值范围,0.65-0.86[95%CI:0.64,0.86]),而 ChatGPT 3.5 显示出适度到良好的一致性(κ 值范围,0.36-0.68[95%CI:0.36,0.68])。与 Bard 相比,ChatGPT 4.0 的准确率为 78%-86%(95%CI:76%,88%),敏感度为 86%-95%(95%CI:83%,96%),而 Bard 的准确率为 74%-86%(95%CI:71%,88%),敏感度为 74%-91%(95%CI:71%,93%)。此外,与 ChatGPT 4.0 相比,图像到文本-LLM 策略的 AUC(0.83[95%CI:0.80,0.85])和准确率(84%[95%CI:82%,86%])与两名资深读者和一名初级读者的人类-LLM 交互策略相当,超过了一名初级读者的人类-LLM 交互策略。结论 LLM,特别是与图像到文本方法相结合,在增强诊断医学影像方面具有潜力。与 Bard 和 ChatGPT 3.5 相比,ChatGPT 4.0 在一致性和诊断准确性方面表现最佳。