Jiang Xia, Zhou Yijun, Wells Alan, Brufsky Adam
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA.
Department of Pathology, University of Pittsburgh and Pittsburgh VA Health System, Pittsburgh, PA 15261, USA.
Cancers (Basel). 2025 Jul 30;17(15):2515. doi: 10.3390/cancers17152515.
: Unlike most cancers, breast cancer poses a persistent risk of distant recurrence-often years after initial treatment-making long-term risk stratification uniquely challenging. Current tools fall short in predicting late metastatic events, particularly for early-stage patients. : We present an interpretable machine learning (ML) pipeline to predict distant recurrence-free survival at 5, 10, and 15 years, integrating Bayesian network-based causal feature selection, deep feed-forward neural network models (DNMs), and SHAP-based interpretation. Using electronic health record (EHR)-based clinical data from over 6000 patients, we first applied the Markov blanket and interactive risk factor learner (MBIL) to identify minimally sufficient predictor subsets. These were then used to train optimized DNM classifiers, with hyperparameters tuned via grid search and benchmarked against models from 10 traditional ML methods and models trained using all predictors. : Our best models achieved area under the curve (AUC) scores of 0.79, 0.83, and 0.89 for 5-, 10-, and 15-year predictions, respectively-substantially outperforming baselines. MBIL reduced input dimensionality by over 80% without sacrificing accuracy. Importantly, MBIL-selected features (e.g., nodal status, hormone receptor expression, tumor size) overlapped strongly with top SHAP contributors, reinforcing interpretability. Calibration plots further demonstrated close agreement between predicted probabilities and observed recurrence rates. The percentage performance improvement due to grid search ranged from 25.3% to 60%. : This study demonstrates that combining causal selection, deep learning, and grid search improves prediction accuracy, transparency, and calibration for long-horizon breast cancer recurrence risk. The proposed framework is well-positioned for clinical use, especially to guide long-term follow-up and therapy decisions in early-stage patients.
与大多数癌症不同,乳腺癌存在远处复发的持续风险——通常在初始治疗数年之后——这使得长期风险分层极具挑战性。目前的工具在预测晚期转移事件方面存在不足,尤其是对于早期患者。我们提出了一种可解释的机器学习(ML)流程,用于预测5年、10年和15年的无远处复发生存率,该流程整合了基于贝叶斯网络的因果特征选择、深度前馈神经网络模型(DNM)和基于SHAP的解释。利用来自6000多名患者的基于电子健康记录(EHR)的临床数据,我们首先应用马尔可夫毯和交互式风险因素学习器(MBIL)来识别最小充分预测子集。然后将这些子集用于训练优化的DNM分类器,通过网格搜索调整超参数,并与10种传统ML方法的模型以及使用所有预测因子训练的模型进行基准测试。我们的最佳模型在5年、10年和15年预测中的曲线下面积(AUC)得分分别为0.79、0.83和0.89——显著优于基线。MBIL在不牺牲准确性的情况下将输入维度降低了80%以上。重要的是,MBIL选择的特征(如淋巴结状态、激素受体表达、肿瘤大小)与SHAP贡献最大的因素高度重叠,增强了可解释性。校准图进一步证明了预测概率与观察到的复发率之间的密切一致性。由于网格搜索导致的性能提升百分比在25.3%至60%之间。这项研究表明,结合因果选择、深度学习和网格搜索可以提高对长期乳腺癌复发风险的预测准确性、透明度和校准。所提出的框架非常适合临床应用,特别是用于指导早期患者的长期随访和治疗决策。
Clin Orthop Relat Res. 2024-9-1
Health Technol Assess. 2006-9
Cancers (Basel). 2023-3-25
J Big Data. 2021
Comput Struct Biotechnol J. 2021-5-1
BMC Bioinformatics. 2020-7-10
Z Med Phys. 2018-12-13