用于预测乳腺癌内乳淋巴结转移的可解释机器学习模型：多方法开发与跨队列验证

Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation.

作者信息

Xiang Yirong, Tie Jian, Zhang Siyuan, Shi Chen, Guo Changkuo, Peng Yushuo, Fan Zhaoqing, Wang Weihu

机构信息

Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Radiation Oncology, Peking University Cancer Hospital and Institute, China.

Breast Center, Peking University Cancer Hospital and Institute, China.

出版信息

Breast. 2025 Jun 9;82:104517. doi: 10.1016/j.breast.2025.104517.

DOI:10.1016/j.breast.2025.104517

PMID:40516245

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12205679/

Abstract

BACKGROUND

This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.

MATERIALS AND METHODS

This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.

RESULTS

Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790-0·843), 0.806 (0·760-0·857) and 0·864 (0·830-0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897-4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853-2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.

CONCLUSION

This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.

摘要

背景

本研究开发了一种用于乳腺癌患者基线期内乳淋巴结转移（IMNM）的可解释机器学习模型。

材料与方法

本研究纳入了三个队列：来自北京大学肿瘤医院的一个推导队列（n = 1997）、来自同一中心的一个时间测试队列（n = 633）以及一个监测、流行病学与结果（SEER）队列（n = 51420）。采用了多种机器学习策略：用于特征选择的最小绝对收缩和选择算子（LASSO）、博鲁塔算法、向后逐步回归和最佳子集法，以及用于模型构建的逻辑回归（LR）、支持向量机（SVM）、k近邻（KNN）和极端梯度提升（XGBoost）。在内部和时间测试队列中对表现最佳的模型进行了验证。进行了夏普利值附加解释（SHAP）分析以提高可解释性。

结果

利用六个临床特征（临床N分期、大小、分期、分类、分级和位置）构建了基于支持向量机的最终预测模型。该模型表现稳健，在训练队列、内部测试队列和时间测试队列中的曲线下面积（AUC）分别为0.811（0.790 - 0.843）、0.806（0.760 - 0.857）和0.864（0.830 - 0.926）。高危患者的无病生存期（DFS）（风险比[HR] 2.776，95%置信区间[CI]：1.897 - 4.064，p < 0.001）和总生存期（OS）（HR为1.962，95% CI：1.853 - 2.077，p < 0.001）明显更差。建立了一个在线预测工具，允许用户输入关键临床变量并获得模型预测概率以及基于SHAP的解释。