用于乳腺癌诊断的基于血清代谢组学的可解释机器学习：多目标特征选择驱动的LightGBM-SHAP模型的见解

Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models.

作者信息

Guldogan Emek, Yagin Fatma Hilal, Ucuzal Hasan, Alzakari Sarah A, Alhussan Amel Ali, Ardigò Luca Paolo

机构信息

Department of Biostatistics, and Medical Informatics, Faculty of Medicine, Inonu University, 44280 Malatya, Turkey.

Department of Biostatistics, Faculty of Medicine, Malatya Turgut Ozal University, 44210 Malatya, Turkey.

出版信息

Medicina (Kaunas). 2025 Jun 19;61(6):1112. doi: 10.3390/medicina61061112.

DOI:10.3390/medicina61061112

PMID:40572800

Abstract

Breast cancer accounts for 12.5% of all new cancer cases in women worldwide. Early detection significantly improves survival rates, but traditional biomarkers like CA 15-3 and HER2 lack sensitivity and specificity, particularly for early-stage disease. Advances in metabolomics and machine learning, particularly explainable artificial intelligence (XAI), offer new opportunities for identifying robust biomarkers and improving diagnostic accuracy. This study aimed to identify and validate serum-based metabolic biomarkers for breast cancer using advanced metabolomic profiling techniques and a Light Gradient Boosting Machine (LightGBM) model. Additionally, SHapley Additive exPlanations (SHAP) were applied to enhance model interpretability and biological insight. The study included 103 breast cancer patients and 31 healthy controls. Serum samples underwent liquid and gas chromatography-time-of-flight mass spectrometry (LC-TOFMS and GC-TOFMS). Mutual Information (MI), Sparse Partial Least Squares (sPLS), Boruta, and Multi-Objective Feature Selection (MOFS) approaches were applied to the data for biomarker discovery. LightGBM, AdaBoost, and Random Forest were employed for classification and to identify class imbalance with the Synthetic Minority Oversampling Technique (SMOTE). SHAP analysis ranked metabolites based on their contribution to model predictions. Compared to other feature selection approaches, the MOFS approach was more robust in terms of predictive performance, and metabolites identified by this method were used in subsequent analyses for biomarker discovery. LightGBM outperformed the AdaBoost and Random Forest models, achieving 86.6% accuracy, 89.1% sensitivity, 84.2% specificity, and an F1-score of 87.0%. SHAP analysis identified 2-Aminobutyric acid, choline, and coproporphyrin as the most influential metabolites, with dysregulation of these markers associated with breast cancer risk. This study is among the first to integrate SHAP explainability with metabolomic profiling, bridging computational predictions and biological insights for improved clinical adoption. This study demonstrates the effectiveness of combining metabolomics with XAI-driven machine learning for breast cancer diagnostics. The identified biomarkers not only improve diagnostic accuracy but also reveal critical metabolic dysregulations associated with disease progression.

摘要

乳腺癌占全球女性所有新发癌症病例的12.5%。早期检测可显著提高生存率，但传统生物标志物如CA 15-3和HER2缺乏敏感性和特异性，尤其是对于早期疾病。代谢组学和机器学习的进展，特别是可解释人工智能（XAI），为识别可靠的生物标志物和提高诊断准确性提供了新机会。本研究旨在使用先进的代谢组学分析技术和轻梯度提升机（LightGBM）模型来识别和验证基于血清的乳腺癌代谢生物标志物。此外，应用夏普利值加法解释（SHAP）来增强模型的可解释性和生物学洞察力。该研究纳入了103例乳腺癌患者和31名健康对照。血清样本进行了液相和气相色谱-飞行时间质谱分析（LC-TOFMS和GC-TOFMS）。互信息（MI）、稀疏偏最小二乘法（sPLS）、Boruta和多目标特征选择（MOFS）方法应用于数据以发现生物标志物。使用LightGBM、AdaBoost和随机森林进行分类，并使用合成少数类过采样技术（SMOTE）识别类别不平衡。SHAP分析根据代谢物对模型预测的贡献对其进行排名。与其他特征选择方法相比，MOFS方法在预测性能方面更稳健，该方法识别出的代谢物用于后续生物标志物发现分析。LightGBM的表现优于AdaBoost和随机森林模型，准确率达到86.6%，灵敏度为89.1%，特异性为84.2%，F1分数为87.0%。SHAP分析确定2-氨基丁酸、胆碱和粪卟啉为最具影响力的代谢物，这些标志物的失调与乳腺癌风险相关。本研究是首批将SHAP可解释性与代谢组学分析相结合的研究之一，为改善临床应用在计算预测和生物学见解之间架起了桥梁。本研究证明了将代谢组学与XAI驱动的机器学习相结合用于乳腺癌诊断的有效性。所识别的生物标志物不仅提高了诊断准确性，还揭示了与疾病进展相关的关键代谢失调。

相似文献

Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models.

Medicina (Kaunas). 2025 Jun 19;61(6):1112. doi: 10.3390/medicina61061112.

Prediction of Insulin Resistance in Nondiabetic Population Using LightGBM and Cohort Validation of Its Clinical Value: Cross-Sectional and Retrospective Cohort Study.

JMIR Med Inform. 2025 Jun 13;13:e72238. doi: 10.2196/72238.

A Responsible Framework for Assessing, Selecting, and Explaining Machine Learning Models in Cardiovascular Disease Outcomes Among People With Type 2 Diabetes: Methodology and Validation Study.

JMIR Med Inform. 2025 Jun 27;13:e66200. doi: 10.2196/66200.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

XGB-BIF: An XGBoost-Driven Biomarker Identification Framework for Detecting Cancer Using Human Genomic Data.

Int J Mol Sci. 2025 Jun 11;26(12):5590. doi: 10.3390/ijms26125590.

Serum calcium-based interpretable machine learning model for predicting anastomotic leakage after rectal cancer resection: A multi-center study.

World J Gastroenterol. 2025 May 21;31(19):105283. doi: 10.3748/wjg.v31.i19.105283.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Proposed Comprehensive Methodology Integrated with Explainable Artificial Intelligence for Prediction of Possible Biomarkers in Metabolomics Panel of Plasma Samples for Breast Cancer Detection.

Medicina (Kaunas). 2025 Mar 25;61(4):581. doi: 10.3390/medicina61040581.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Extracellular vesicles as biomarkers for metabolic dysfunction-associated steatotic liver disease staging using explainable artificial intelligence.

World J Gastroenterol. 2025 Jun 14;31(22):106937. doi: 10.3748/wjg.v31.i22.106937.

本文引用的文献

Integrating Molecular Perspectives: Strategies for Comprehensive Multi-Omics Integrative Data Analysis and Machine Learning Applications in Transcriptomics, Proteomics, and Metabolomics.

Biology (Basel). 2024 Oct 22;13(11):848. doi: 10.3390/biology13110848.

Identification of a Novel Biomarker Panel for Breast Cancer Screening.

Int J Mol Sci. 2024 Nov 4;25(21):11835. doi: 10.3390/ijms252111835.

Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development.

Clin Transl Sci. 2024 Nov;17(11):e70056. doi: 10.1111/cts.70056.

Combining metabolomics and machine learning to discover biomarkers for early-stage breast cancer diagnosis.

PLoS One. 2024 Oct 21;19(10):e0311810. doi: 10.1371/journal.pone.0311810. eCollection 2024.

Hippo pathway effectors YAP, TAZ and TEAD are associated with EMT master regulators ZEB, Snail and with aggressive phenotype in phyllodes breast tumors.

Pathol Res Pract. 2024 Oct;262:155551. doi: 10.1016/j.prp.2024.155551. Epub 2024 Aug 15.

Roles and Mechanisms of Choline Metabolism in Nonalcoholic Fatty Liver Disease and Cancers.

Front Biosci (Landmark Ed). 2024 May 11;29(5):182. doi: 10.31083/j.fbl2905182.

Identification of PTPN12 Phosphatase as a Novel Negative Regulator of Hippo Pathway Effectors YAP/TAZ in Breast Cancer.

Int J Mol Sci. 2024 Apr 5;25(7):4064. doi: 10.3390/ijms25074064.

LC-MS/MS platform-based serum untargeted screening reveals the diagnostic biomarker panel and molecular mechanism of breast cancer.

Methods. 2024 Feb;222:100-111. doi: 10.1016/j.ymeth.2024.01.003. Epub 2024 Jan 14.

Clinical Significance of Carnitine in the Treatment of Cancer: From Traffic to the Regulation.

Oxid Med Cell Longev. 2023 Aug 10;2023:9328344. doi: 10.1155/2023/9328344. eCollection 2023.

FoxO signaling and mitochondria-related apoptosis pathways mediate tsinling lenok trout (Brachymystax lenok tsinlingensis) liver injury under high temperature stress.

Int J Biol Macromol. 2023 Nov 1;251:126404. doi: 10.1016/j.ijbiomac.2023.126404. Epub 2023 Aug 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于乳腺癌诊断的基于血清代谢组学的可解释机器学习：多目标特征选择驱动的LightGBM-SHAP模型的见解

Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献