文献检索，用中文搜 PubMed

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

Papoutsoglou Georgios, Tarazona Sonia, Lopes Marta B, Klammsteiner Thomas, Ibrahimi Eliana, Eckenberger Julia, Novielli Pierfrancesco, Tonda Alberto, Simeon Andrea, Shigdel Rajesh, Béreux Stéphane, Vitali Giacomo, Tangaro Sabina, Lahti Leo, Temko Andriy, Claesson Marcus J, Berland Magali

Department of Computer Science, University of Crete, Heraklion, Greece.

JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece.

Front Microbiol. 2023 Sep 22;14:1261889. doi: 10.3389/fmicb.2023.1261889. eCollection 2023.

Department of Computer Science, University of Crete, Heraklion, Greece.

JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece.

Front Microbiol. 2023 Sep 22;14:1261889. doi: 10.3389/fmicb.2023.1261889. eCollection 2023.

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

机器学习（ML）工作流程中的微生物组数据预测分析存在众多特定领域的挑战，涉及预处理、特征选择、预测建模、性能评估、模型解释以及从结果中提取生物信息。为协助决策，我们基于COST行动ML4Microbiome提供了一组关于算法选择、流程创建和评估的建议。我们在结直肠癌患者的多队列鸟枪法宏基因组学数据集上比较了建议的方法，重点关注它们在疾病诊断和生物标志物发现方面的性能。结果表明，将成分转换和过滤方法用作数据预处理的一部分并不总能提高模型的预测性能。相比之下，多变量特征选择，如统计等效特征算法，在减少分类误差方面是有效的。在单独的测试数据集上进行验证时，该算法与随机森林建模相结合，提供了最准确的性能估计。最后，我们展示了如何通过逻辑回归进行线性建模，并结合个体条件期望（ICE）图等可视化技术得出可解释的结果并提供生物学见解。这些发现对临床医生和非专家在转化应用中都具有重要意义。