用于乳腺癌诊断的基于血清代谢组学的可解释机器学习:多目标特征选择驱动的LightGBM-SHAP模型的见解
Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models.
作者信息
Guldogan Emek, Yagin Fatma Hilal, Ucuzal Hasan, Alzakari Sarah A, Alhussan Amel Ali, Ardigò Luca Paolo
机构信息
Department of Biostatistics, and Medical Informatics, Faculty of Medicine, Inonu University, 44280 Malatya, Turkey.
Department of Biostatistics, Faculty of Medicine, Malatya Turgut Ozal University, 44210 Malatya, Turkey.
出版信息
Medicina (Kaunas). 2025 Jun 19;61(6):1112. doi: 10.3390/medicina61061112.
Breast cancer accounts for 12.5% of all new cancer cases in women worldwide. Early detection significantly improves survival rates, but traditional biomarkers like CA 15-3 and HER2 lack sensitivity and specificity, particularly for early-stage disease. Advances in metabolomics and machine learning, particularly explainable artificial intelligence (XAI), offer new opportunities for identifying robust biomarkers and improving diagnostic accuracy. This study aimed to identify and validate serum-based metabolic biomarkers for breast cancer using advanced metabolomic profiling techniques and a Light Gradient Boosting Machine (LightGBM) model. Additionally, SHapley Additive exPlanations (SHAP) were applied to enhance model interpretability and biological insight. The study included 103 breast cancer patients and 31 healthy controls. Serum samples underwent liquid and gas chromatography-time-of-flight mass spectrometry (LC-TOFMS and GC-TOFMS). Mutual Information (MI), Sparse Partial Least Squares (sPLS), Boruta, and Multi-Objective Feature Selection (MOFS) approaches were applied to the data for biomarker discovery. LightGBM, AdaBoost, and Random Forest were employed for classification and to identify class imbalance with the Synthetic Minority Oversampling Technique (SMOTE). SHAP analysis ranked metabolites based on their contribution to model predictions. Compared to other feature selection approaches, the MOFS approach was more robust in terms of predictive performance, and metabolites identified by this method were used in subsequent analyses for biomarker discovery. LightGBM outperformed the AdaBoost and Random Forest models, achieving 86.6% accuracy, 89.1% sensitivity, 84.2% specificity, and an F1-score of 87.0%. SHAP analysis identified 2-Aminobutyric acid, choline, and coproporphyrin as the most influential metabolites, with dysregulation of these markers associated with breast cancer risk. This study is among the first to integrate SHAP explainability with metabolomic profiling, bridging computational predictions and biological insights for improved clinical adoption. This study demonstrates the effectiveness of combining metabolomics with XAI-driven machine learning for breast cancer diagnostics. The identified biomarkers not only improve diagnostic accuracy but also reveal critical metabolic dysregulations associated with disease progression.
乳腺癌占全球女性所有新发癌症病例的12.5%。早期检测可显著提高生存率,但传统生物标志物如CA 15-3和HER2缺乏敏感性和特异性,尤其是对于早期疾病。代谢组学和机器学习的进展,特别是可解释人工智能(XAI),为识别可靠的生物标志物和提高诊断准确性提供了新机会。本研究旨在使用先进的代谢组学分析技术和轻梯度提升机(LightGBM)模型来识别和验证基于血清的乳腺癌代谢生物标志物。此外,应用夏普利值加法解释(SHAP)来增强模型的可解释性和生物学洞察力。该研究纳入了103例乳腺癌患者和31名健康对照。血清样本进行了液相和气相色谱-飞行时间质谱分析(LC-TOFMS和GC-TOFMS)。互信息(MI)、稀疏偏最小二乘法(sPLS)、Boruta和多目标特征选择(MOFS)方法应用于数据以发现生物标志物。使用LightGBM、AdaBoost和随机森林进行分类,并使用合成少数类过采样技术(SMOTE)识别类别不平衡。SHAP分析根据代谢物对模型预测的贡献对其进行排名。与其他特征选择方法相比,MOFS方法在预测性能方面更稳健,该方法识别出的代谢物用于后续生物标志物发现分析。LightGBM的表现优于AdaBoost和随机森林模型,准确率达到86.6%,灵敏度为89.1%,特异性为84.2%,F1分数为87.0%。SHAP分析确定2-氨基丁酸、胆碱和粪卟啉为最具影响力的代谢物,这些标志物的失调与乳腺癌风险相关。本研究是首批将SHAP可解释性与代谢组学分析相结合的研究之一,为改善临床应用在计算预测和生物学见解之间架起了桥梁。本研究证明了将代谢组学与XAI驱动的机器学习相结合用于乳腺癌诊断的有效性。所识别的生物标志物不仅提高了诊断准确性,还揭示了与疾病进展相关的关键代谢失调。