Suppr超能文献

XGB-BIF:一种用于利用人类基因组数据检测癌症的基于XGBoost的生物标志物识别框架。

XGB-BIF: An XGBoost-Driven Biomarker Identification Framework for Detecting Cancer Using Human Genomic Data.

作者信息

Ghuriani Veena, Wassan Jyotsna Talreja, Tripathi Priyal, Chauhan Anshika

机构信息

Maitreyi College, University of Delhi, New Delhi 110021, India.

出版信息

Int J Mol Sci. 2025 Jun 11;26(12):5590. doi: 10.3390/ijms26125590.

Abstract

The human genome has a profound impact on human health and disease detection. Carcinoma (cancer) is one of the prominent diseases that majorly affect human health and requires the development of different treatment strategies and targeted therapies based on effective disease detection. Therefore, our research aims to identify biomarkers associated with distinct cancer types (gastric, lung, and breast) using machine learning. In the current study, we have analyzed the human genomic data of gastric cancer, breast cancer, and lung cancer patients using XGB-BIF (i.e., XGBoost-Driven Biomarker Identification Framework for detecting cancer). The proposed framework utilizes feature selection via XGBoost (eXtreme Gradient Boosting), which captures feature interactions efficiently and takes care of the non-linear effects in the genomic data. The research progressed by training XGBoost on the full dataset, ranking the features based on the Gain measure (importance), followed by the classification phase, which employed support vector machines (SVM), logistic regression (LR), and random forest (RF) models for classifying cancer-diseased and non-diseased states. To ensure interpretability and transparency, we also applied SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), enabling the identification of high-impact biomarkers contributing to risk stratification. Biomarker significance is discussed primarily via pathway enrichment and by studying survival analysis (Kaplan-Meier curves, Cox regression) for identified biomarkers to strengthen translational value. Our models achieved high predictive performance, with an accuracy of more than 90%, to classify and link genomic data into diseased (cancer) and non-diseased states. Furthermore, we evaluated the models using Cohen's Kappa statistic, which confirmed strong agreement between predicted and actual risk categories, with Kappa scores ranging from 0.80 to 0.99. Our proposed framework also achieved strong predictions on the METABRIC dataset during external validation, attaining an AUC-ROC of 93%, accuracy of 0.79%, and Kappa of 74%. Through extensive experimentation, XGB-BIF identified the top biomarker genes for different cancer datasets (gastric, lung, and breast). , , , , , , , and were identified as important biomarkers to identify diseased and non-diseased states of gastric cancer; , , , , and were identified as important biomarkers for breast cancer; and , , , , , and were identified as important biomarkers for lung cancer. XGB-BIF could be utilized for identifying biomarkers of different cancer types using genetic data, which can further help clinicians in developing targeted therapies for cancer patients.

摘要

人类基因组对人类健康和疾病检测有着深远影响。癌症是严重影响人类健康的主要疾病之一,需要基于有效的疾病检测制定不同的治疗策略和靶向疗法。因此,我们的研究旨在利用机器学习识别与不同癌症类型(胃癌、肺癌和乳腺癌)相关的生物标志物。在当前研究中,我们使用XGB - BIF(即用于检测癌症的XGBoost驱动的生物标志物识别框架)分析了胃癌、乳腺癌和肺癌患者的人类基因组数据。所提出的框架通过XGBoost(极端梯度提升)进行特征选择,它能有效捕捉特征交互并处理基因组数据中的非线性效应。研究过程包括在完整数据集上训练XGBoost,基于增益度量(重要性)对特征进行排序,随后进入分类阶段,该阶段采用支持向量机(SVM)、逻辑回归(LR)和随机森林(RF)模型对癌症患病和未患病状态进行分类。为确保可解释性和透明度,我们还应用了夏普利加性解释(SHAP)和局部可解释模型无关解释(LIME),从而能够识别有助于风险分层的高影响力生物标志物。主要通过通路富集以及研究已识别生物标志物的生存分析(卡普兰 - 迈耶曲线、Cox回归)来讨论生物标志物的重要性,以增强转化价值。我们的模型实现了较高的预测性能,分类准确率超过90%,能够将基因组数据与患病(癌症)和未患病状态联系起来。此外,我们使用科恩卡帕统计量对模型进行评估,该统计量证实了预测风险类别与实际风险类别之间的高度一致性,卡帕分数范围为0.80至0.99。我们提出的框架在外部验证期间对METABRIC数据集也取得了强大预测结果,曲线下面积(AUC - ROC)为93%,准确率为0.79%,卡帕值为74%。通过广泛实验,XGB - BIF为不同癌症数据集(胃癌、肺癌和乳腺癌)识别出了顶级生物标志物基因。 、 、 、 、 、 、 被确定为识别胃癌患病和未患病状态的重要生物标志物; 、 、 、 、 被确定为乳腺癌的重要生物标志物; 、 、 、 、 、 被确定为肺癌的重要生物标志物。XGB - BIF可用于利用遗传数据识别不同癌症类型的生物标志物,这有助于临床医生为癌症患者制定靶向疗法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验