Suppr超能文献

通过稳定特征选择确定微生物组-代谢组学整合的最佳机器学习方法。

Identifying Optimal Machine Learning Approaches for Microbiome-Metabolomics Integration with Stable Feature Selection.

作者信息

Palmer Suzette N, Mishra Animesh, Gan Shuheng, Liu Dajiang, Koh Andrew Y, Zhan Xiaowei

机构信息

Division of Hematology/Oncology, Department of Pediatrics, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.

Department of Biomedical Engineering, The University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.

出版信息

bioRxiv. 2025 Jun 30:2025.06.21.660858. doi: 10.1101/2025.06.21.660858.

Abstract

Microbiome research has been limited by methodological inconsistencies. Taxonomy-based profiling presents challenges such as data sparsity, variable taxonomic resolution, and the reliance on DNA-based profiling, which provides limited functional insight. Multi-omics integration has emerged as a promising approach to link microbiome composition with function. However, the lack of standardized methodologies and inconsistencies in machine learning strategies has hindered reproducibility. Additionally, while machine learning can be used to identify key microbial and metabolic features, the stability of feature selection across models and data types remains underexplored, despite its importance for downstream experimental validation and biomarker discovery. Here, we systematically compare Elastic Net, Random Forest, and XGBoost across five multi-omics integration strategies: Concatenation, Averaged Stacking, Weighted Non-negative Least Squares (NNLS), Lasso Stacking, and Partial Least Squares (PLS), as well as individual 'omics models. We evaluate performance across 588 binary and 735 continuous models using microbiome-derived metabolomics and taxonomic data. Additionally, we assess the impact of feature reduction on model performance and feature selection stability. Among the approaches tested, Random Forest combined with NNLS yielded the highest overall performance across diverse datasets. Tree-based methods also demonstrated consistent feature selection across data types and dimensionalities. These results demonstrate how integration strategies, algorithm selection, data dimensionality, and response type impact both predictive performance and the stability of selected features in multi-omics microbiome modeling.

摘要

微生物组研究一直受到方法不一致性的限制。基于分类学的分析存在诸多挑战,如数据稀疏性、可变的分类分辨率以及对基于DNA的分析的依赖,而这种分析提供的功能见解有限。多组学整合已成为一种将微生物组组成与功能联系起来的有前景的方法。然而,缺乏标准化方法以及机器学习策略的不一致性阻碍了可重复性。此外,虽然机器学习可用于识别关键的微生物和代谢特征,但跨模型和数据类型的特征选择稳定性仍未得到充分探索,尽管其对下游实验验证和生物标志物发现很重要。在此,我们系统地比较了弹性网络、随机森林和XGBoost在五种多组学整合策略上的表现:串联、平均堆叠、加权非负最小二乘法(NNLS)、套索堆叠和偏最小二乘法(PLS),以及单个“组学”模型。我们使用微生物组衍生的代谢组学和分类学数据评估了588个二元模型和735个连续模型的性能。此外,我们评估了特征约简对模型性能和特征选择稳定性的影响。在所测试的方法中,随机森林与NNLS相结合在不同数据集中产生了最高的整体性能。基于树的方法在不同数据类型和维度上也表现出一致的特征选择。这些结果表明了整合策略、算法选择、数据维度和响应类型如何影响多组学微生物组建模中的预测性能和所选特征的稳定性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c47e/12236860/91a7bf9bf3c8/nihpp-2025.06.21.660858v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验