Radhakrishnan Swarnima Kollampallath, Nath Dipanwita, Russ Dominic, Merodio Laura Bravo, Lad Priyani, Daisi Folakemi Kola, Acharjee Animesh
College of Medicine and Health, School of Medical Sciences, Cancer and Genomic Sciences, University of Birmingham, Birmingham, United Kingdom.
Institute of Translational Medicine, University Hospitals Birmingham National Health Service (NHS) Foundation Trust, Birmingham, United Kingdom.
Front Oncol. 2025 Jan 7;14:1505675. doi: 10.3389/fonc.2024.1505675. eCollection 2024.
Colorectal cancer is one of the leading causes of cancer-related mortality in the world. Incidence and mortality are predicted to rise globally during the next several decades. When detected early, colorectal cancer is treatable with surgery and medications. This leads to the requirement for prognostic and diagnostic biomarker development. Our study integrates machine learning models and protein network analysis to identify protein biomarkers for colorectal cancer. Our methodology leverages an extensive collection of proteome profiles from both healthy and colorectal cancer individuals. To identify a potential biomarker with high predictive ability, we used three machine learning models. To enhance the interpretability of our models, we quantify each protein's contribution to the model's predictions using SHapley Additive exPlanations values. Three classifiers-LASSO, XGBoost, and LightGBM were evaluated for predictive performance along with hyperparameter tuning of each model using grid search, with LASSO achieving the highest AUC of 75% in the UK Biobank dataset and the AUCs for LightGBM and XGBoost are 69.61% and 71.42%, respectively. Using SHapley Additive exPlanations values, TFF3, LCN2, and CEACAM5 were found to be key biomarkers associated with cell adhesion and inflammation. Protein quantitative trait loci analyze studies provided further evidence for the involvement of TFF1, CEACAM5, and SELE in colorectal cancer, with possible connections to the PI3K/Akt and MAPK signaling pathways. By offering insights into colorectal cancer diagnostics and targeted therapeutics, our findings set the stage for further biomarker validation.
结直肠癌是全球癌症相关死亡的主要原因之一。预计在未来几十年里,全球范围内结直肠癌的发病率和死亡率将会上升。如果能早期发现,结直肠癌可以通过手术和药物进行治疗。这就使得开发预后和诊断生物标志物成为必要。我们的研究整合了机器学习模型和蛋白质网络分析,以识别结直肠癌的蛋白质生物标志物。我们的方法利用了来自健康个体和结直肠癌患者的大量蛋白质组图谱。为了识别具有高预测能力的潜在生物标志物,我们使用了三种机器学习模型。为了提高模型的可解释性,我们使用SHapley加性解释值来量化每种蛋白质对模型预测的贡献。我们评估了三种分类器——LASSO、XGBoost和LightGBM的预测性能,并使用网格搜索对每个模型进行超参数调整,在英国生物银行数据集中,LASSO的AUC最高,达到75%,LightGBM和XGBoost的AUC分别为69.61%和71.42%。使用SHapley加性解释值,发现三叶因子3(TFF3)、脂质运载蛋白2(LCN2)和癌胚抗原相关细胞黏附分子5(CEACAM5)是与细胞黏附和炎症相关的关键生物标志物。蛋白质数量性状位点分析研究为三叶因子1(TFF1)、CEACAM5和选择素E(SELE)参与结直肠癌提供了进一步证据,它们可能与磷脂酰肌醇-3-激酶/蛋白激酶B(PI3K/Akt)和丝裂原活化蛋白激酶(MAPK)信号通路有关。通过深入了解结直肠癌的诊断和靶向治疗,我们的研究结果为进一步验证生物标志物奠定了基础。