Prucker Philipp, Busch Felix, Dorfner Felix, Mertens Christian J, Bayerl Nadine, Makowski Marcus R, Bressem Keno K, Adams Lisa C
Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany.
Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Radiology, Charitéplatz 1, 10117 Berlin, Germany.
Clin Imaging. 2025 Sep;125:110557. doi: 10.1016/j.clinimag.2025.110557. Epub 2025 Jul 5.
Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists.
Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed.
Inter-rater reliability was substantial to near perfect (κ = 0.76-0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (p = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, p < 0.001).
Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.
大语言模型(LLMs)在生成患者友好型放射学报告方面显示出前景,但开源与专有大语言模型的性能需要评估。通过定量可读性指标和放射科医生的定性评估,比较开源和专有大语言模型在从胸部CT生成患者友好型放射学报告方面的表现。
由七个大语言模型处理五十份胸部CT报告:三个开源模型(Llama - 3 - 70b、Mistral - 7b、Mixtral - 8x7b)和四个专有模型(GPT - 4、GPT - 3.5 - Turbo、Claude - 3 - Opus、Gemini - Ultra)。使用五个定量可读性指标评估简化情况。三位放射科医生根据五个标准,采用五点李克特量表对患者友好度进行评分。统计内容和连贯性错误。对评分者间信度和模型间差异进行统计学评估。
评分者间信度较高至近乎完美(κ = 0.76 - 0.86)。定性方面,Llama - 3 - 70b在五分之四的类别中不逊色于领先的专有模型。GPT - 3.5 - Turbo显示出最佳的整体可读性,在两个指标上优于GPT - 4。Llama - 3 - 70b在CLI上优于GPT - 3.5 - Turbo(p = 0.006)。Claude - 3 - Opus和Gemini - Ultra在可读性方面得分较低,但在定性评估中获得高分。Claude - 3 - Opus保持了完美的事实准确性。Claude - 3 - Opus和GPT - 4在情感敏感性方面优于Llama - 3 - 70b(90.0%对46.0%,p < 0.001)。
Llama - 3 - 70b在生成高质量、患者友好型放射学报告方面显示出强大潜力,对专有模型构成挑战。通过进一步调整,开源大语言模型可以推动患者友好型报告技术的发展。