Kwon Mi-Ri, Chang Yoosoo, Ham Soo-Youn, Cho Yoosun, Kim Eun Young, Kang Jeonggyu, Park Eun Kyung, Kim Ki Hwan, Kim Minjeong, Kim Tae Soo, Lee Hyeonsoo, Kwon Ria, Lim Ga-Young, Choi Hye Rin, Choi JunHyeok, Kook Shin Ho, Ryu Seungho
Department of Radiology, Kangbuk Samsung Hospital, Sungkyunkwan University School of Medicine, Seoul, South Korea.
Center for Cohort Studies, Kangbuk Samsung Hospital, Sungkyunkwan University School of Medicine, Samsung Main Building B2, 250, Taepyung-ro 2ga, Jung-gu, 04514, Seoul, South Korea.
Breast Cancer Res. 2024 Apr 22;26(1):68. doi: 10.1186/s13058-024-01821-w.
Artificial intelligence (AI) algorithms for the independent assessment of screening mammograms have not been well established in a large screening cohort of Asian women. We compared the performance of screening digital mammography considering breast density, between radiologists and AI standalone detection among Korean women.
We retrospectively included 89,855 Korean women who underwent their initial screening digital mammography from 2009 to 2020. Breast cancer within 12 months of the screening mammography was the reference standard, according to the National Cancer Registry. Lunit software was used to determine the probability of malignancy scores, with a cutoff of 10% for breast cancer detection. The AI's performance was compared with that of the final Breast Imaging Reporting and Data System category, as recorded by breast radiologists. Breast density was classified into four categories (A-D) based on the radiologist and AI-based assessments. The performance metrics (cancer detection rate [CDR], sensitivity, specificity, positive predictive value [PPV], recall rate, and area under the receiver operating characteristic curve [AUC]) were compared across breast density categories.
Mean participant age was 43.5 ± 8.7 years; 143 breast cancer cases were identified within 12 months. The CDRs (1.1/1000 examination) and sensitivity values showed no significant differences between radiologist and AI-based results (69.9% [95% confidence interval [CI], 61.7-77.3] vs. 67.1% [95% CI, 58.8-74.8]). However, the AI algorithm showed better specificity (93.0% [95% CI, 92.9-93.2] vs. 77.6% [95% CI, 61.7-77.9]), PPV (1.5% [95% CI, 1.2-1.9] vs. 0.5% [95% CI, 0.4-0.6]), recall rate (7.1% [95% CI, 6.9-7.2] vs. 22.5% [95% CI, 22.2-22.7]), and AUC values (0.8 [95% CI, 0.76-0.84] vs. 0.74 [95% CI, 0.7-0.78]) (all P < 0.05). Radiologist and AI-based results showed the best performance in the non-dense category; the CDR and sensitivity were higher for radiologists in the heterogeneously dense category (P = 0.059). However, the specificity, PPV, and recall rate consistently favored AI-based results across all categories, including the extremely dense category.
AI-based software showed slightly lower sensitivity, although the difference was not statistically significant. However, it outperformed radiologists in recall rate, specificity, PPV, and AUC, with disparities most prominent in extremely dense breast tissue.
用于独立评估乳腺钼靶筛查的人工智能(AI)算法在亚洲女性的大型筛查队列中尚未得到充分验证。我们比较了韩国女性中,放射科医生与AI独立检测在考虑乳腺密度的数字化乳腺钼靶筛查中的表现。
我们回顾性纳入了89,855名在2009年至2020年期间接受首次数字化乳腺钼靶筛查的韩国女性。根据国家癌症登记处的数据,筛查乳腺钼靶检查后12个月内的乳腺癌为参考标准。使用Lunit软件确定恶性概率评分,乳腺癌检测的临界值为10%。将AI的表现与乳腺放射科医生记录的最终乳腺影像报告和数据系统类别进行比较。根据放射科医生和基于AI的评估,将乳腺密度分为四类(A - D)。比较了不同乳腺密度类别中的性能指标(癌症检测率[CDR]、敏感性、特异性、阳性预测值[PPV]、召回率和受试者操作特征曲线下面积[AUC])。
参与者的平均年龄为43.5±8.7岁;在12个月内发现了143例乳腺癌病例。CDR(1.1/1000次检查)和敏感性值在放射科医生和基于AI的结果之间无显著差异(69.9%[95%置信区间[CI],61.7 - 77.3]对67.1%[9�%CI,58.8 - 74.8])。然而,AI算法显示出更好的特异性(93.0%[95%CI,92.9 - 93.2]对77.6%[95%CI,61.7 - 77.9])、PPV(1.5%[95%CI,1.2 - 1.9]对0.5%[95%CI,0.4 - 0.6])、召回率(7.1%[95%CI,6.9 - 7.2]对22.5%[95%CI,22.2 - 2`2.7])和AUC值(0.8[95%CI,0.76 - 0.84]对0.74[95%CI,0.7 - 0.78])(所有P<0.05)。放射科医生和基于AI的结果在非致密类别中表现最佳;在不均匀致密类别中,放射科医生的CDR和敏感性更高(P = 0.059)。然而,在包括极度致密类别在内的所有类别中,特异性、PPV和召回率始终有利于基于AI的结果。
基于AI的软件显示出稍低的敏感性,尽管差异无统计学意义。然而,它在召回率、特异性、PPV和AUC方面优于放射科医生,在极度致密的乳腺组织中差异最为明显。