Hong Eun Kyoung, Ham Jiyeon, Roh Byungseok, Gu Jawook, Park Beomhee, Kang Sunghun, You Kihyun, Eom Jihwan, Bae Byeonguk, Jo Jae-Bock, Song Ok Kyu, Bae Woong, Lee Ro Woon, Suh Chong Hyun, Park Chan Ho, Choi Seong Jun, Park Jai Soung, Park Jae-Hyeong, Jeon Hyun Jeong, Hong Jeong-Ho, Cho Dosang, Choi Han Seok, Kim Tae Hee
Department of Radiology, Brigham & Women's Hospital, 75 Francis St, Boston, MA 02215.
Kakao, Seoul, South Korea.
Radiology. 2025 Mar;314(3):e241476. doi: 10.1148/radiol.241476.
Background Generative artificial intelligence (AI) is anticipated to alter radiology workflows, requiring a clinical value assessment for frequent examinations like chest radiograph interpretation. Purpose To develop and evaluate the diagnostic accuracy and clinical value of a domain-specific multimodal generative AI model for providing preliminary interpretations of chest radiographs. Materials and Methods For training, consecutive radiograph-report pairs from frontal chest radiography were retrospectively collected from 42 hospitals (2005-2023). The trained domain-specific AI model generated radiology reports for the radiographs. The test set included public datasets (PadChest, Open-i, VinDr-CXR, and MIMIC-CXR-JPG) and radiographs excluded from training. The sensitivity and specificity of the model-generated reports for 13 radiographic findings, compared with radiologist annotations (reference standard), were calculated (with 95% CIs). Four radiologists evaluated the subjective quality of the reports in terms of acceptability, agreement score, quality score, and comparative ranking of reports from the domain-specific AI model, radiologists, and a general-purpose large language model (GPT-4Vision). Acceptability was defined as whether the radiologist would endorse the report as their own without changes. Agreement scores from 1 (clinically significant discrepancy) to 5 (complete agreement) were assigned using RADPEER; quality scores were on a 5-point Likert scale from 1 (very poor) to 5 (excellent). Results A total of 8 838 719 radiograph-report pairs (training) and 2145 radiographs (testing) were included (anonymized with respect to sex and gender). Reports generated by the domain-specific AI model demonstrated high sensitivity for detecting two critical radiographic findings: 95.3% (181 of 190) for pneumothorax and 92.6% (138 of 149) for subcutaneous emphysema. Acceptance rate, evaluated by four radiologists, was 70.5% (6047 of 8680), 73.3% (6288 of 8580), and 29.6% (2536 of 8580) for model-generated, radiologist, and GPT-4Vision reports, respectively. Agreement scores were highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for GPT-4Vision reports (median = 1 [IQR, 1-3]; < .001). Quality scores were also highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for the GPT-4Vision reports (median = 2 [IQR, 1-3]; < .001). From the ranking analysis, model-generated reports were most frequently ranked the highest (60.0%; 5146 of 8580), and GPT-4Vision reports were most frequently ranked the lowest (73.6%; 6312 of 8580). Conclusion A domain-specific multimodal generative AI model demonstrated potential for high diagnostic accuracy and clinical value in providing preliminary interpretations of chest radiographs for radiologists. © RSNA, 2025 See also the editorial by Little in this issue.
背景 生成式人工智能(AI)预计会改变放射学工作流程,这就需要对胸部X光片解读等常见检查进行临床价值评估。目的 开发并评估一种特定领域的多模态生成式AI模型对胸部X光片进行初步解读的诊断准确性和临床价值。材料与方法 为进行训练,回顾性收集了42家医院(2005年至2023年)连续的胸部正位X光片-报告对。经过训练的特定领域AI模型为X光片生成放射学报告。测试集包括公共数据集(PadChest、Open-i、VinDr-CXR和MIMIC-CXR-JPG)以及训练中排除的X光片。计算了模型生成报告针对13种影像学表现与放射科医生注释(参考标准)相比的敏感性和特异性(含95%置信区间)。四名放射科医生从可接受性、一致性评分、质量评分以及特定领域AI模型、放射科医生和通用大语言模型(GPT-4Vision)报告的比较排名等方面评估了报告的主观质量。可接受性定义为放射科医生是否会毫无改动地认可该报告为其自己的报告。使用RADPEER分配从1(临床显著差异)到5(完全一致)的一致性评分;质量评分采用5分制李克特量表,从1(非常差)到5(优秀)。结果 共纳入8838719对X光片-报告对(训练)和2145张X光片(测试)(对性别进行了匿名处理)。特定领域AI模型生成的报告对两种关键影像学表现具有较高的敏感性:气胸为95.3%(190例中的181例),皮下气肿为92.6%(149例中的138例)。四名放射科医生评估的接受率分别为:模型生成报告70.5%(8680例中的6047例)、放射科医生报告73.3%(8580例中的6288例)、GPT-4Vision报告29.6%(8580例中的2536例)。模型生成报告的一致性评分最高(中位数 = 4 [四分位间距,3 - 5]),GPT-4Vision报告最低(中位数 = 1 [四分位间距,1 - 3];P <.001)。质量评分也是模型生成报告最高(中位数 = 4 [四分位间距,3 - 5]),GPT-4Vision报告最低(中位数 = 2 [四分位间距,1 - 3];P <.001)。从排名分析来看,模型生成报告最常排名最高(60.0%;8580例中的5146例),GPT-4Vision报告最常排名最低(73.6%;8580例中的6312例)。结论 一种特定领域的多模态生成式AI模型在为放射科医生提供胸部X光片初步解读方面显示出高诊断准确性和临床价值的潜力。© RSNA,2025 另见本期Little的社论。