Hickman Alistair J, Gomes Sandra, Warren Lucy M, Smith Nadia A S, Shenton-Taylor Caroline
Department of Physics, University of Surrey, Guildford, United Kingdom.
Department of Scientific Computing and National Co-ordinating Centre for the Physics of Mammography, Royal Surrey NHS Foundation Trust, Guildford, United Kingdom.
PLOS Digit Health. 2025 Aug 12;4(8):e0000973. doi: 10.1371/journal.pdig.0000973. eCollection 2025 Aug.
The aim of this study was to determine whether differences between manufacturer of mammogram images effects performance of artificial intelligence tools for classifying breast density. Processed mammograms from 10,156 women were used to train and validate three deep learning algorithms using three retrospective datasets: Hologic, General Electric, Mixed (equal numbers of Hologic, General Electric and Siemens images) and tested on four independent witheld test sets (Hologic, General Electric, Mixed and Siemens). The area under the receiver operating characteristic curve (AUC) was compared. Women aged 47-73 with normal breasts (routine recall - no cancer) and Volpara ground truth were selected from the OPTIMAM Mammography Image Database for the years 2012-2015. 95 % confidence intervals are used for significance testing in the results with a Bayesian Signed Rank test used to rank the overall performance of the models. Best single test performance is seen when a model is trained and tested on images from a single manufacturer (Hologic train/test: 0.98 and General Electric train/test: 0.97), however the same models performed significantly worse on any other manufacturer images (General Electric AUCs: 0.68 & 0.63; Hologic AUCs: 0.56 & 0.90). The model trained on the mixed dataset exhibited the best overall performance. Better performance occurs when training and test sets contain the same manufacturer distributions and better generalisation occurs when more manufacturers are included in training. Models in clinical use should be trained on data representing the different vendors of mammogram machines used across screening programs. This is clinically relevant as models will be impacted by changes and upgrades to mammogram machines in screening centres.
本研究的目的是确定乳腺钼靶图像制造商之间的差异是否会影响用于分类乳腺密度的人工智能工具的性能。使用来自10156名女性的处理后的乳腺钼靶图像,通过三个回顾性数据集(Hologic、通用电气、混合数据集(Hologic、通用电气和西门子图像数量相等))对三种深度学习算法进行训练和验证,并在四个独立的保留测试集(Hologic、通用电气、混合数据集和西门子)上进行测试。比较了受试者操作特征曲线(AUC)下的面积。从2012年至2015年的OPTIMAM乳腺钼靶图像数据库中选取47至73岁乳房正常(常规召回——无癌症)且有Volpara真实数据的女性。在结果的显著性检验中使用95%置信区间,并使用贝叶斯符号秩检验对模型的整体性能进行排名。当模型在来自单一制造商的图像上进行训练和测试时,可看到最佳的单一测试性能(Hologic训练/测试:0.98,通用电气训练/测试:0.97),然而,相同的模型在任何其他制造商的图像上表现明显更差(通用电气的AUC:0.68和0.63;Hologic的AUC:0.56和0.90)。在混合数据集上训练的模型表现出最佳的整体性能。当训练集和测试集包含相同的制造商分布时,性能更好;当训练中包含更多制造商时,泛化性更好。临床使用的模型应在代表筛查项目中使用的不同乳腺钼靶机供应商的数据上进行训练。这在临床上具有相关性,因为模型将受到筛查中心乳腺钼靶机的变化和升级的影响。