From the Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China (Lu Zhang, L.W., Y.Z., Y.F., J.Z., Lin Zhang, G.Y., X. Xie); Winning Health Technology, Shanghai, China (M.L., X. Xu, Z.P., X.C.); and Department of Radiology, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Yan Chang Zhong Rd 301, Shanghai 200040, China (X. Xie).
Radiology. 2024 Sep;312(3):e240885. doi: 10.1148/radiol.240885.
Background The specialization and complexity of radiology makes the automatic generation of radiologic impressions (ie, a diagnosis with differential diagnosis and management recommendations) challenging. Purpose To develop a large language model (LLM) that generates impressions based on imaging findings and to evaluate its performance in professional and linguistic dimensions. Materials and Methods Six radiologists recorded imaging examination findings from August 2 to 31, 2023, at Shanghai General Hospital and used the developed LLM before routinely writing report impressions for multiple radiologic modalities (CT, MRI, radiography, mammography) and anatomic sites (cranium and face, neck, chest, upper abdomen, lower abdomen, vessels, bone and joint, spine, breast), making necessary corrections and completing the radiologic impression. A subset was defined to investigate cases where the LLM-generated impressions differed from the final radiologist impressions by excluding identical and highly similar cases. An expert panel scored the LLM-generated impressions on a five-point Likert scale (5 = strongly agree) based on scientific terminology, coherence, specific diagnosis, differential diagnosis, management recommendations, correctness, comprehensiveness, harmlessness, and lack of bias. Results In this retrospective study, an LLM was pretrained using 20 GB of medical and general-purpose text data. The fine-tuning data set comprised 1.5 GB of data, including 800 radiology reports with paired instructions (describing the output task in natural language) and outputs. Test set 2 included data from 3988 patients (median age, 56 years [IQR, 40-68 years]; 2159 male). The median recall, precision, and F1 score of LLM-generated impressions were 0.775 (IQR, 0.56-1), 0.84 (IQR, 0.611-1), and 0.772 (IQR, 0.578-0.957), respectively, using the final impressions as the reference standard. In a subset of 1014 patients (median age, 57 years [IQR, 42-69 years]; 528 male), the overall median expert panel score for LLM-generated impressions was 5 (IQR, 5-5), ranging from 4 (IQR, 3-5) to 5 (IQR, 5-5). Conclusion The developed LLM generated radiologic impressions that were professionally and linguistically appropriate for a full spectrum of radiology examinations. © RSNA, 2024
背景 放射学的专业化和复杂性使得自动生成放射学印象(即具有鉴别诊断和管理建议的诊断)具有挑战性。目的 开发一种基于影像学发现生成印象的大型语言模型(LLM),并评估其在专业和语言维度上的性能。材料与方法 6 名放射科医生于 2023 年 8 月 2 日至 31 日在上海总医院记录影像学检查结果,在常规书写多种影像学方式(CT、MRI、X 线摄影、乳房 X 线摄影)和解剖部位(颅面、颈部、胸部、上腹部、下腹部、血管、骨与关节、脊柱、乳房)的报告印象之前,使用开发的 LLM 进行检查,并进行必要的更正并完成放射学印象。定义了一个子集,通过排除完全相同和高度相似的病例,研究 LLM 生成的印象与最终放射科医生印象不同的病例。一个专家小组根据科学术语、连贯性、具体诊断、鉴别诊断、管理建议、正确性、全面性、无害性和无偏见性,对 LLM 生成的印象进行了五分制 Likert 量表评分(5=非常同意)。结果 在这项回顾性研究中,使用 20GB 的医学和通用文本数据对 LLM 进行了预训练。微调数据集包括 1.5GB 的数据,包括 800 份带有配对指令(用自然语言描述输出任务)和输出的放射学报告。测试集 2 包含 3988 名患者的数据(中位年龄 56 岁[IQR,40-68 岁];2159 名男性)。使用最终印象作为参考标准,LLM 生成的印象的召回率、精度和 F1 评分中位数分别为 0.775(IQR,0.56-1)、0.84(IQR,0.611-1)和 0.772(IQR,0.578-0.957)。在 1014 名患者的子集中(中位年龄 57 岁[IQR,42-69 岁];528 名男性),LLM 生成的印象的专家小组评分中位数总体为 5(IQR,5-5),范围为 4(IQR,3-5)至 5(IQR,5-5)。结论 开发的 LLM 生成的放射学印象在专业和语言上适合各种放射学检查。© RSNA,2024