Sun Jeffrey, Sun Cheuk-Kay, Tang Yun-Xuan, Liu Tzu-Chi, Lu Chi-Jie
Department of Acute Medicine, West Middlesex University Hospital, London TW7 6AF, UK.
School of Medicine, Imperial College London, London SW7 2BX, UK.
Healthcare (Basel). 2023 Jul 11;11(14):2000. doi: 10.3390/healthcare11142000.
Mammography is considered the gold standard for breast cancer screening. Multiple risk factors that affect breast cancer development have been identified; however, there is an ongoing debate regarding the significance of these factors. Machine learning (ML) models and Shapley Additive Explanation (SHAP) methodology can rank risk factors and provide explanatory model results. This study used ML algorithms with SHAP to analyze the risk factors between two different age groups and evaluate the impact of each factor in predicting positive mammography. The ML model was built using data from the risk factor questionnaires of women participating in a breast cancer screening program from 2017 to 2021. Three ML models, least absolute shrinkage and selection operator (lasso) logistic regression, extreme gradient boosting (XGBoost), and random forest (RF), were applied. RF generated the best performance. The SHAP values were then applied to the RF model for further analysis. The model identified age at menarche, education level, parity, breast self-examination, and BMI as the top five significant risk factors affecting mammography outcomes. The differences between age groups ranked by reproductive lifespan and BMI were higher in the younger and older age groups, respectively. The use of SHAP frameworks allows us to understand the relationships between risk factors and generate individualized risk factor rankings. This study provides avenues for further research and individualized medicine.
乳腺钼靶检查被认为是乳腺癌筛查的金标准。已经确定了多种影响乳腺癌发展的风险因素;然而,关于这些因素的重要性仍存在争议。机器学习(ML)模型和夏普利值加性解释(SHAP)方法可以对风险因素进行排序,并提供解释性的模型结果。本研究使用带有SHAP的ML算法来分析两个不同年龄组之间的风险因素,并评估每个因素在预测乳腺钼靶检查阳性结果中的影响。ML模型是使用2017年至2021年参与乳腺癌筛查项目的女性风险因素调查问卷中的数据构建的。应用了三种ML模型,即最小绝对收缩和选择算子(lasso)逻辑回归、极端梯度提升(XGBoost)和随机森林(RF)。RF表现最佳。然后将SHAP值应用于RF模型进行进一步分析。该模型确定初潮年龄、教育水平、生育情况、乳房自我检查和体重指数是影响乳腺钼靶检查结果的前五大重要风险因素。按生殖寿命和体重指数排序的年龄组之间的差异分别在较年轻和较年长的年龄组中更大。使用SHAP框架使我们能够理解风险因素之间的关系,并生成个性化的风险因素排名。本研究为进一步研究和个性化医疗提供了途径。