Sing David C, Shah Kishan S, Pompliano Michael, Yi Paul H, Velluto Calogero, Bagheri Ali, Eastlack Robert K, Stephan Stephen R, Mundis Gregory M
Division of Spine Surgery, Department of Orthopaedic Surgery, Scripps Clinic, 10710 N Torrey Pines Rd, La Jolla, CA, 92037, United States, 1 8585547988.
Department of Radiology, St. Jude Children's Research Hospital, Memphis, TN, United States.
JMIR AI. 2025 Jul 1;4:e69654. doi: 10.2196/69654.
Magnetic resonance imaging (MRI) reports are challenging for patients to interpret and may subject patients to unnecessary anxiety. The advent of advanced artificial intelligence (AI) large language models (LLMs), such as GPT-4o, hold promise for translating complex medical information into layman terms.
This paper aims to evaluate the accuracy, helpfulness, and readability of GPT-4o in explaining MRI reports of patients with thoracolumbar fractures.
MRI reports of 20 patients presenting with thoracic or lumbar vertebral body fractures were obtained. GPT-4o was prompted to explain the MRI report in layman's terms. The generated explanations were then presented to 7 board-certified spine surgeons for evaluation on the reports' helpfulness and accuracy. The MRI report text and GPT-4o explanations were then analyzed to grade the readability of the texts using the Flesch Readability Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL) Scale.
The layman explanations provided by GPT-4o were found to be helpful by all surgeons in 17 cases, with 6 of 7 surgeons finding the information helpful in the remaining 3 cases. ChatGPT-generated layman reports were rated as "accurate" by all 7 surgeons in 11/20 cases (55%). In an additional 5/20 cases (25%), 6 out of 7 surgeons agreed on their accuracy. In the remaining 4/20 cases (20%), accuracy ratings varied, with 4 or 5 surgeons considering them accurate. Review of surgeon feedback on inaccuracies revealed that the radiology reports were often insufficiently detailed. The mean FRES score of the MRI reports was significantly lower than the GPT-4o explanations (32.15, SD 15.89 vs 53.9, SD 7.86; P<.001). The mean FKGL score of the MRI reports trended higher compared to the GPT-4o explanations (11th-12th grade vs 10th-11th grade level; P=.11).
Overall helpfulness and readability ratings for AI-generated summaries of MRI reports were high, with few inaccuracies recorded. This study demonstrates the potential of GPT-4o to serve as a valuable tool for enhancing patient comprehension of MRI report findings.
磁共振成像(MRI)报告对于患者来说难以解读,可能会使患者产生不必要的焦虑。先进的人工智能(AI)大语言模型(LLMs)的出现,如GPT-4o,有望将复杂的医学信息转化为通俗易懂的语言。
本文旨在评估GPT-4o在解释胸腰椎骨折患者MRI报告方面的准确性、实用性和可读性。
获取了20例胸椎或腰椎椎体骨折患者的MRI报告。要求GPT-4o用通俗易懂的语言解释MRI报告。然后将生成的解释提交给7名获得委员会认证的脊柱外科医生,以评估报告的实用性和准确性。随后分析MRI报告文本和GPT-4o的解释,使用弗莱什易读性评分(FRES)和弗莱什-金凯德年级水平(FKGL)量表对文本的可读性进行评分。
在17例病例中,所有外科医生都认为GPT-4o提供的通俗易懂的解释是有帮助的,在其余3例病例中,7名外科医生中有6名认为这些信息有帮助。在20例病例中的11例(55%)中,所有7名外科医生都将ChatGPT生成的通俗易懂的报告评为“准确”。在另外5例(25%)病例中,7名外科医生中有6名对其准确性达成一致。在其余4例(20%)病例中,准确性评分各不相同,4名或5名外科医生认为它们是准确的。对外科医生关于不准确之处的反馈进行审查发现,放射学报告往往不够详细。MRI报告的平均FRES评分显著低于GPT-4o的解释(32.15,标准差15.89对53.9,标准差7.86;P<0.001)。与GPT-4o的解释相比,MRI报告的平均FKGL评分有升高趋势(11 - 12年级对10 - 11年级水平;P = 0.11)。
人工智能生成的MRI报告总结的总体实用性和可读性评分较高,记录的不准确之处较少。本研究证明了GPT-4o作为增强患者对MRI报告结果理解的有价值工具的潜力。