Cai Lie, Golatta Michael, Sidey-Gibbons Chris, Barr Richard G, Pfob André
Department of Obstetrics and Gynecology, Breast Cancer Center, Heidelberg University Hospital, Im Neuenheimer Feld 440, 69120, Heidelberg, Germany.
Breast Centre Heidelberg, Klinik St. Elisabeth, Heidelberg, Germany.
Arch Gynecol Obstet. 2025 Jan 30. doi: 10.1007/s00404-024-07901-8.
Artificial Intelligence models based on medical (imaging) data are increasingly developed. However, the imaging software on which the original data is generated is frequently updated. The impact of updated imaging software on the performance of AI models is unclear. We aimed to develop machine learning models using shear wave elastography (SWE) data to identify malignant breast lesions and to test the models' generalizability by validating them on external data generated by both the original updated software versions.
We developed and validated different machine learning models (GLM, MARS, XGBoost, SVM) using multicenter, international SWE data (NCT02638935) using tenfold cross-validation. Findings were compared to the histopathologic evaluation of the biopsy specimen or 2-year follow-up. The outcome measure was the area under the curve (AUROC).
We included 1288 cases in the development set using the original imaging software and 385 cases in the validation set using both, original and updated software. In the external validation set, the GLM and XGBoost models showed better performance with the updated software data compared to the original software data (AUROC 0.941 vs. 0.902, p < 0.001 and 0.934 vs. 0.872, p < 0.001). The MARS model showed worse performance with the updated software data (0.847 vs. 0.894, p = 0.045). SVM was not calibrated.
In this multicenter study using SWE data, some machine learning models demonstrated great potential to bridge the gap between original software and updated software, whereas others exhibited weak generalizability.
基于医学(影像)数据的人工智能模型正在不断发展。然而,生成原始数据的影像软件经常更新。更新后的影像软件对人工智能模型性能的影响尚不清楚。我们旨在使用剪切波弹性成像(SWE)数据开发机器学习模型,以识别乳腺恶性病变,并通过在原始软件版本和更新软件版本生成的外部数据上进行验证,来测试模型的通用性。
我们使用多中心、国际SWE数据(NCT02638935),通过十折交叉验证开发并验证了不同的机器学习模型(广义线性模型、多元自适应回归样条、极端梯度提升、支持向量机)。将结果与活检标本的组织病理学评估或2年随访结果进行比较。结果指标为曲线下面积(AUROC)。
我们在使用原始影像软件的开发集中纳入了1288例病例,在使用原始软件和更新软件的验证集中纳入了385例病例。在外部验证集中,与原始软件数据相比,广义线性模型和极端梯度提升模型在更新软件数据上表现更好(AUROC分别为0.941对0.902,p<0.001和0.934对0.872,p<0.001)。多元自适应回归样条模型在更新软件数据上表现更差(0.847对0.894,p=0.045)。支持向量机未进行校准。
在这项使用SWE数据的多中心研究中,一些机器学习模型显示出弥合原始软件和更新软件之间差距的巨大潜力,而另一些模型则表现出较弱的通用性。