Division of Chronic Disease Epidemiology, Epidemiology Biostatistics and Prevention Institute, University of Zurich, Zurich, Switzerland.
Cancer Registry Zurich, Zug, Schaffhausen and Schwyz, University Hospital Zurich, Zurich, Switzerland.
Int J Cancer. 2023 Sep 1;153(5):932-941. doi: 10.1002/ijc.34568. Epub 2023 May 27.
Breast cancer survivors often experience recurrence or a second primary cancer. We developed an automated approach to predict the occurrence of any second breast cancer (SBC) using patient-level data and explored the generalizability of the models with an external validation data source. Breast cancer patients from the cancer registry of Zurich, Zug, Schaffhausen, Schwyz (N = 3213; training dataset) and the cancer registry of Ticino (N = 1073; external validation dataset), diagnosed between 2010 and 2018, were used for model training and validation, respectively. Machine learning (ML) methods, namely a feed-forward neural network (ANN), logistic regression, and extreme gradient boosting (XGB) were employed for classification. The best-performing model was selected based on the receiver operating characteristic (ROC) curve. Key characteristics contributing to a high SBC risk were identified. SBC was diagnosed in 6% of all cases. The most important features for SBC prediction were age at incidence, year of birth, stage, and extent of the pathological primary tumor. The ANN model had the highest area under the ROC curve with 0.78 (95% confidence interval [CI] 0.750.82) in the training data and 0.70 (95% CI 0.61-0.79) in the external validation data. Investigating the generalizability of different ML algorithms, we found that the ANN generalized better than the other models on the external validation data. This research is a first step towards the development of an automated tool that could assist clinicians in the identification of women at high risk of developing an SBC and potentially preventing it.
乳腺癌幸存者常经历复发或第二原发癌。我们开发了一种自动方法,使用患者水平数据预测任何第二乳腺癌(SBC)的发生,并使用外部验证数据源探索模型的泛化能力。分别使用苏黎世、楚格、沙夫豪森、施维茨癌症登记处(N=3213;训练数据集)和提契诺癌症登记处(N=1073;外部验证数据集)的乳腺癌患者(诊断时间为 2010 年至 2018 年)进行模型训练和验证。采用机器学习(ML)方法,即前馈神经网络(ANN)、逻辑回归和极端梯度增强(XGB)进行分类。根据接收者操作特征(ROC)曲线选择表现最佳的模型。确定了导致高 SBC 风险的关键特征。所有病例中有 6%诊断为 SBC。SBC 预测的最重要特征是发病时的年龄、出生年份、分期和病理原发肿瘤的范围。ANN 模型在训练数据中的 ROC 曲线下面积最高,为 0.78(95%置信区间[CI] 0.750.82),在外部验证数据中的 ROC 曲线下面积为 0.70(95%CI 0.61-0.79)。研究不同 ML 算法的泛化能力,我们发现 ANN 在外部验证数据上的泛化能力优于其他模型。这项研究是开发一种自动化工具的第一步,该工具可以帮助临床医生识别有发生 SBC 风险的女性,并可能预防其发生。