Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.
Division of Radiation Oncology, National Cancer Centre Singapore, Singapore Health Services, Singapore, Singapore.
JCO Clin Cancer Inform. 2024 Nov;8:e2400071. doi: 10.1200/CCI.24.00071. Epub 2024 Nov 22.
Neoadjuvant chemotherapy (NAC) is increasingly used in breast cancer. Predictive modeling is useful in predicting pathologic complete response (pCR) to NAC. We test machine learning (ML) models to predict pCR in breast cancer and explore methods of handling missing data.
Four hundred and ninety-nine patients with breast cancer treated with NAC in two centers in Singapore (National Cancer Centre Singapore [NCCS] and KK Hospital) between January 2014 and December 2017 were included. Eleven clinical features were used to train five different ML models. Listwise deletion and imputation were evaluated on handling missing data. Model performance was evaluated by AUC and calibration (Brier score). Feature importance from the best performing model in the external testing data set was calculated using Shapley additive explanations.
Seventy-two (24.6%), 18 (24.7%), and 31 (24.8%) patients attained pCR in NCCS training, NCCS testing, and KK Women's and Children's Hospital (KKH) testing data sets, respectively. The random forest (RF) base and imputed models have the highest AUCs in the KKH cohort of 0.794 (95% CI, 0.709 to 0.873) and 0.795 (95% CI, 0.706 to 0.871), respectively, and were the best calibrated with the lowest Brier score. No statistically significant difference was noted between AUCs of the base and imputed models in all data sets. The imputed model had a larger positive predictive value (PPV; 98.2% 95.1%) and negative predictive value (NPV; 96.7% 90.0%) than the base model in the KKH data set. Estrogen receptor intensity, human epidermal growth factor 2 intensity, and age at diagnosis were the three most important predictors.
ML, particularly RF, demonstrates reasonable accuracy in pCR prediction after NAC. Imputing missing fields in the data can improve the PPV and NPV of the pCR prediction model.
新辅助化疗(NAC)在乳腺癌中的应用日益增多。预测模型有助于预测 NAC 的病理完全缓解(pCR)。我们测试了机器学习(ML)模型来预测乳腺癌的 pCR,并探索了处理缺失数据的方法。
纳入 2014 年 1 月至 2017 年 12 月在新加坡两个中心(新加坡国家癌症中心[NCCS]和 KK 医院)接受 NAC 治疗的 499 例乳腺癌患者。使用 11 个临床特征来训练 5 个不同的 ML 模型。列表删除和插补用于处理缺失数据。通过 AUC 和校准(Brier 评分)评估模型性能。使用 Shapley 加法解释计算最佳外测数据集中的特征重要性。
NCCS 训练、NCCS 测试和 KK 妇女儿童医院(KKH)测试数据集分别有 72 例(24.6%)、18 例(24.7%)和 31 例(24.8%)患者达到 pCR。随机森林(RF)基础模型和插补模型在 KKH 队列中的 AUC 最高,分别为 0.794(95%CI,0.709 至 0.873)和 0.795(95%CI,0.706 至 0.871),且校准度最佳,Brier 评分最低。在所有数据集,基础模型和插补模型的 AUC 之间无统计学差异。插补模型在 KKH 数据集中的阳性预测值(PPV;98.2% 95.1%)和阴性预测值(NPV;96.7% 90.0%)均高于基础模型。雌激素受体强度、人表皮生长因子 2 强度和诊断时年龄是三个最重要的预测因子。
ML,尤其是 RF,在 NAC 后对 pCR 预测具有合理的准确性。对数据中缺失字段进行插补可以提高 pCR 预测模型的 PPV 和 NPV。