CCPred：使用机器学习技术在不同分子水平上进行全球和人群特异性结直肠癌预测以及宏基因组生物标志物鉴定。

CCPred: Global and population-specific colorectal cancer prediction and metagenomic biomarker identification at different molecular levels using machine learning techniques.

机构信息

Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey.

Department of Electrical and Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey.

出版信息

Comput Biol Med. 2024 Nov;182:109098. doi: 10.1016/j.compbiomed.2024.109098. Epub 2024 Sep 17.

DOI:10.1016/j.compbiomed.2024.109098

PMID:39293338

Abstract

Colorectal cancer (CRC) ranks as the third most common cancer globally and the second leading cause of cancer-related deaths. Recent research highlights the pivotal role of the gut microbiota in CRC development and progression. Understanding the complex interplay between disease development and metagenomic data is essential for CRC diagnosis and treatment. Current computational models employ machine learning to identify metagenomic biomarkers associated with CRC, yet there is a need to improve their accuracy through a holistic biological knowledge perspective. This study aims to evaluate CRC-associated metagenomic data at species, enzymes, and pathway levels via conducting global and population-specific analyses. These analyses utilize relative abundance values from human gut microbiome sequencing data and robust classification models are built for disease prediction and biomarker identification. For global CRC prediction and biomarker identification, the features that are identified by SelectKBest (SKB), Information Gain (IG), and Extreme Gradient Boosting (XGBoost) methods are combined. Population-based analysis includes within-population, leave-one-dataset-out (LODO) and cross-population approaches. Four classification algorithms are employed for CRC classification. Random Forest achieved an AUC of 0.83 for species data, 0.78 for enzyme data and 0.76 for pathway data globally. On the global scale, potential taxonomic biomarkers include ruthenibacterium lactatiformanas; enzyme biomarkers include RNA 2' 3' cyclic 3' phosphodiesterase; and pathway biomarkers include pyruvate fermentation to acetone pathway. This study underscores the potential of machine learning models trained on metagenomic data for improved disease prediction and biomarker discovery. The proposed model and associated files are available at https://github.com/TemizMus/CCPRED.

摘要

结直肠癌（CRC）是全球第三大常见癌症，也是癌症相关死亡的第二大主要原因。最近的研究强调了肠道微生物群在 CRC 发展和进展中的关键作用。了解疾病发展和宏基因组数据之间的复杂相互作用对于 CRC 的诊断和治疗至关重要。目前的计算模型使用机器学习来识别与 CRC 相关的宏基因组生物标志物，但需要从整体生物学知识的角度来提高它们的准确性。本研究旨在通过进行全球和特定人群的分析，评估 CRC 相关的宏基因组数据在物种、酶和途径水平上的特征。这些分析利用人类肠道微生物组测序数据的相对丰度值，并为疾病预测和生物标志物识别构建稳健的分类模型。对于全球 CRC 预测和生物标志物识别，通过 SelectKBest（SKB）、信息增益（IG）和极端梯度提升（XGBoost）方法确定的特征进行组合。基于人群的分析包括人群内、离开一个数据集（LODO）和跨人群方法。四种分类算法用于 CRC 分类。随机森林在物种数据、酶数据和途径数据方面的 AUC 分别为 0.83、0.78 和 0.76。在全球范围内，潜在的分类生物标志物包括 Ruthenibacterium lactatiformans；酶生物标志物包括 RNA 2' 3' 环 3' 磷酸二酯酶；途径生物标志物包括丙酮酸盐发酵途径。本研究强调了基于宏基因组数据训练的机器学习模型在改善疾病预测和生物标志物发现方面的潜力。该模型和相关文件可在 https://github.com/TemizMus/CCPRED 上获取。