Woo MinJae, Zhang Linglin, Brown-Mulry Beatrice, Hwang InChan, Gichoya Judy Wawira, Gastounioti Aimilia, Banerjee Imon, Seyyed-Kalantari Laleh, Trivedi Hari
Department of Public Health Sciences, Clemson University, Clemson, South Carolina, United States of America.
Artificial Intelligence Ethics Laboratory, Equifax Inc., Alpharetta, Georgia, United States of America.
PLOS Digit Health. 2025 Apr 8;4(4):e0000811. doi: 10.1371/journal.pdig.0000811. eCollection 2025 Apr.
This study evaluates a deep learning model for classifying normal versus potentially abnormal regions of interest (ROIs) on mammography, aiming to identify imaging, pathologic, and demographic characteristics that may induce suboptimal model performance in certain patient subgroups. We utilized the EMory BrEast imaging Dataset (EMBED), containing 3.4 million mammographic images from 115,931 patients. Full-field digital mammograms from women aged 18 years or older were used to create positive and negative patches with the patches matched based on size, location, patient demographics, and imaging features. Several convolutional neural network (CNN) architectures were tested, with ResNet152V2 demonstrating the best performance. The dataset was split into training (29,144 patches), validation (9,910 patches), and testing (13,390 patches) sets. Performance metrics included accuracy, AUC, recall, precision, F1 score, false negative rate, and false positive rate. Subgroup analysis was conducted using univariate and multivariate regression models to control for confounding effects. The classification model achieved an AUC of 0.975 and a recall of 0.927. False negative predictions were significantly associated with White patients (RR = 1.208; p = 0.050), those never biopsied (RR = 1.079; p = 0.011), and cases with architectural distortion (RR = 1.037; p < 0.001). Higher breast density significantly increased the risk of false positives, with BI-RADS density C (RR = 1.891; p < 0.001) and D (RR = 2.486; p < 0.001). Race and age were not significant predictors for false positives in multivariate analysis. These findings suggest that deep learning models for mammography may underperform in specific subgroups. The study underscores the need for more precise patient subgroup analysis and emphasizes the importance of considering confounding factors in deep learning model evaluations. These insights can help develop fair and interpretable decision-making models in mammography, ultimately enhancing the performance and equity of CADe and CADx applications.
本研究评估了一种深度学习模型,用于在乳腺钼靶摄影中对正常与潜在异常的感兴趣区域(ROI)进行分类,旨在识别可能导致某些患者亚组模型性能欠佳的影像学、病理学和人口统计学特征。我们使用了埃默里乳腺影像数据集(EMBED),该数据集包含来自115,931名患者的340万张乳腺钼靶图像。使用18岁及以上女性的全视野数字化乳腺钼靶片来创建正、负样本块,样本块根据大小、位置、患者人口统计学特征和影像学特征进行匹配。测试了几种卷积神经网络(CNN)架构,其中ResNet152V2表现最佳。数据集被分为训练集(29,144个样本块)、验证集(9,910个样本块)和测试集(13,390个样本块)。性能指标包括准确率、AUC、召回率、精确率、F1分数、假阴性率和假阳性率。使用单变量和多变量回归模型进行亚组分析,以控制混杂效应。分类模型的AUC为0.975,召回率为0.927。假阴性预测与白人患者(RR = 1.208;p = 0.050)、从未接受活检的患者(RR = 1.079;p = 0.011)以及存在结构扭曲的病例(RR = 1.037;p < 0.001)显著相关。较高的乳腺密度显著增加了假阳性风险,BI-RADS密度C(RR = 1.891;p < 0.001)和D(RR = 2.486;p < 0.001)。在多变量分析中,种族和年龄不是假阳性的显著预测因素。这些发现表明,用于乳腺钼靶摄影的深度学习模型在特定亚组中可能表现不佳。该研究强调了进行更精确的患者亚组分析的必要性,并强调了在深度学习模型评估中考虑混杂因素的重要性。这些见解有助于在乳腺钼靶摄影中开发公平且可解释的决策模型,最终提高计算机辅助检测(CADe)和计算机辅助诊断(CADx)应用的性能和公平性。