Alyanak Bülent, Çakar İsa, Dede Burak Tayyip, Yıldızgören Mustafa Turgut, Bağcıer Fatih
Gölcük Necati Çelik State Hospital, Department of Physical Medicine and Rehabilitation, Kocaeli, Turkey.
Basaksehir Cam Sakura City Hospital, Department of Radiology, Istanbul, Turkey.
Int J Med Inform. 2025 Nov;203:105999. doi: 10.1016/j.ijmedinf.2025.105999. Epub 2025 Jun 3.
This study aims to evaluate the reliability of plantar fascia thickness measurements performed by ChatGPT-4 using magnetic resonance imaging (MRI) compared to those obtained by an experienced clinician.
In this retrospective, single-center study, foot MRI images from the hospital archive were analysed. Plantar fascia thickness was measured under both blinded and non-blinded conditions by an experienced clinician and ChatGPT-4 at two separate time points. Measurement reliability was assessed using the intraclass correlation coefficient (ICC), mean absolute error (MAE), and mean relative error (MRE).
A total of 41 participants (32 females, 9 males) were included. The average plantar fascia thickness measured by the clinician was 4.20 ± 0.80 mm and 4.25 ± 0.92 mm under blinded and non-blinded conditions, respectively, while ChatGPT-4's measurements were 6.47 ± 1.30 mm and 6.46 ± 1.31 mm, respectively. Human evaluators demonstrated excellent agreement (ICC = 0.983-0.989), whereas ChatGPT-4 exhibited low reliability (ICC = 0.391-0.432). In thin plantar fascia cases, ChatGPT-4's error rate was higher, with MAE = 2.70 mm, MRE = 77.17 % under blinded conditions, and MAE = 2.91 mm, MRE = 87.02 % under non-blinded conditions.
ChatGPT-4 demonstrated lower reliability in plantar fascia thickness measurements compared to an experienced clinician, with increased error rates in thin structures. These findings highlight the limitations of AI-based models in medical image analysis and emphasize the need for further refinement before clinical implementation.
本研究旨在评估ChatGPT-4使用磁共振成像(MRI)测量足底筋膜厚度与经验丰富的临床医生测量结果相比的可靠性。
在这项回顾性单中心研究中,分析了医院存档的足部MRI图像。经验丰富的临床医生和ChatGPT-4在两个不同时间点在盲法和非盲法条件下测量足底筋膜厚度。使用组内相关系数(ICC)、平均绝对误差(MAE)和平均相对误差(MRE)评估测量可靠性。
共纳入41名参与者(32名女性,9名男性)。临床医生在盲法和非盲法条件下测量的平均足底筋膜厚度分别为4.20±0.80毫米和4.25±0.92毫米,而ChatGPT-4的测量值分别为6.47±1.30毫米和6.46±1.31毫米。人类评估者表现出高度一致性(ICC = 0.983 - 0.989),而ChatGPT-4的可靠性较低(ICC = 0.391 - 0.432)。在足底筋膜较薄的病例中,ChatGPT-4的错误率更高,盲法条件下MAE = 2.70毫米,MRE = 77.17%,非盲法条件下MAE = 2.91毫米,MRE = 87.02%。
与经验丰富的临床医生相比,ChatGPT-4在足底筋膜厚度测量中表现出较低的可靠性,在薄结构中错误率增加。这些发现凸显了基于人工智能的模型在医学图像分析中的局限性,并强调在临床应用前需要进一步完善。