Department of Economics, Statistics and Business, Faculty of Economics and Law, Universitas Mercatorum, Rome, Italy.
Department of Biology, University of Tirana, Tirana, Albania.
J Med Microbiol. 2024 Oct;73(10). doi: 10.1099/jmm.0.001903.
The study addresses the challenge of utilizing human gut microbiome data for the early detection of colorectal cancer (CRC). The research emphasizes the potential of using machine learning techniques to analyze complex microbiome datasets, providing a non-invasive approach to identifying CRC-related microbial markers. The primary hypothesis is that a robust machine learning-based analysis of 16S rRNA microbiome data can identify specific microbial features that serve as effective biomarkers for CRC detection, overcoming the limitations of classical statistical models in high-dimensional settings. The primary objective of this study is to explore and validate the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for colorectal cancer (CRC) detection and progression. The focus is on developing a classifier that effectively predicts the presence of CRC and normal samples based on the analysis of three previously published faecal 16S rRNA sequencing datasets. To achieve the aim, various machine learning techniques are employed, including random forest (RF), recursive feature elimination (RFE) and a robust correlation-based technique known as the fuzzy forest (FF). The study utilizes these methods to analyse the three datasets, comparing their performance in predicting CRC and normal samples. The emphasis is on identifying the most relevant microbial features (taxa) associated with CRC development via partial dependence plots, i.e. a machine learning tool focused on explainability, visualizing how a feature influences the predicted outcome. The analysis of the three faecal 16S rRNA sequencing datasets reveals the consistent and superior predictive performance of the FF compared to the RF and RFE. Notably, FF proves effective in addressing the correlation problem when assessing the importance of microbial taxa in explaining the development of CRC. The results highlight the potential of the human microbiome as a non-invasive means to detect CRC and underscore the significance of employing FF for improved predictive accuracy. In conclusion, this study underscores the limitations of classical statistical techniques in handling high-dimensional information such as human microbiome data. The research demonstrates the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for CRC detection. Applying machine learning techniques, particularly the FF, is a promising approach for building a classifier to predict CRC and normal samples. The findings advocate for integrating FF to overcome the challenges associated with correlation when identifying crucial microbial features linked to CRC development.
这项研究旨在解决利用人类肠道微生物组数据进行结直肠癌(CRC)早期检测的挑战。研究强调了使用机器学习技术分析复杂微生物组数据集的潜力,为识别与 CRC 相关的微生物标志物提供了一种非侵入性方法。主要假设是,对 16S rRNA 微生物组数据进行稳健的基于机器学习的分析可以识别出作为 CRC 检测有效生物标志物的特定微生物特征,克服了在高维环境中使用经典统计模型的局限性。本研究的主要目的是探索和验证人类微生物组(特别是结肠中的微生物组)作为结直肠癌(CRC)检测和进展有价值的生物标志物来源的潜力。重点是开发一种分类器,该分类器能够基于对三个先前发表的粪便 16S rRNA 测序数据集的分析,有效地预测 CRC 和正常样本的存在。为了实现这一目标,研究采用了各种机器学习技术,包括随机森林(RF)、递归特征消除(RFE)和一种称为模糊森林(FF)的基于相关性的稳健技术。研究利用这些方法分析了这三个数据集,比较了它们在预测 CRC 和正常样本中的性能。重点是通过偏依赖图识别与 CRC 发展相关的最相关的微生物特征(分类群),即一种专注于可解释性的机器学习工具,可视化特征如何影响预测结果。对三个粪便 16S rRNA 测序数据集的分析表明,FF 与 RF 和 RFE 相比,具有一致且优越的预测性能。值得注意的是,FF 在评估微生物分类群在解释 CRC 发展中的重要性时,对于解决相关性问题非常有效。结果强调了人类微生物组作为一种非侵入性手段检测 CRC 的潜力,并突出了采用 FF 提高预测准确性的重要性。总之,本研究强调了经典统计技术在处理人类微生物组数据等高维信息时的局限性。研究表明,人类微生物组(特别是结肠中的微生物组)作为 CRC 检测的有价值的生物标志物来源具有潜力。应用机器学习技术,特别是 FF,是构建分类器以预测 CRC 和正常样本的有前途的方法。研究结果主张整合 FF 以克服在识别与 CRC 发展相关的关键微生物特征时与相关性相关的挑战。