Xiang Yirong, Tie Jian, Zhang Siyuan, Shi Chen, Guo Changkuo, Peng Yushuo, Fan Zhaoqing, Wang Weihu
Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Radiation Oncology, Peking University Cancer Hospital and Institute, China.
Breast Center, Peking University Cancer Hospital and Institute, China.
Breast. 2025 Jun 9;82:104517. doi: 10.1016/j.breast.2025.104517.
This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.
This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.
Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790-0·843), 0.806 (0·760-0·857) and 0·864 (0·830-0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897-4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853-2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.
This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.
本研究开发了一种用于乳腺癌患者基线期内乳淋巴结转移(IMNM)的可解释机器学习模型。
本研究纳入了三个队列:来自北京大学肿瘤医院的一个推导队列(n = 1997)、来自同一中心的一个时间测试队列(n = 633)以及一个监测、流行病学与结果(SEER)队列(n = 51420)。采用了多种机器学习策略:用于特征选择的最小绝对收缩和选择算子(LASSO)、博鲁塔算法、向后逐步回归和最佳子集法,以及用于模型构建的逻辑回归(LR)、支持向量机(SVM)、k近邻(KNN)和极端梯度提升(XGBoost)。在内部和时间测试队列中对表现最佳的模型进行了验证。进行了夏普利值附加解释(SHAP)分析以提高可解释性。
利用六个临床特征(临床N分期、大小、分期、分类、分级和位置)构建了基于支持向量机的最终预测模型。该模型表现稳健,在训练队列、内部测试队列和时间测试队列中的曲线下面积(AUC)分别为0.811(0.790 - 0.843)、0.806(0.760 - 0.857)和0.864(0.830 - 0.926)。高危患者的无病生存期(DFS)(风险比[HR] 2.776,95%置信区间[CI]:1.897 - 4.064,p < 0.001)和总生存期(OS)(HR为1.962,95% CI:1.853 - 2.077,p < 0.001)明显更差。建立了一个在线预测工具,允许用户输入关键临床变量并获得模型预测概率以及基于SHAP的解释。
这个经过验证且可解释的机器学习模型为早期风险分层提供了一个实用工具,有助于临床医生进行适当的基线影像学选择和辅助治疗规划。