Garach Vélez Ignacio, Ortuño Guzmán Francisco Manuel, Rojas Ruiz Ignacio, Herrera Maldonado Luis Javier
Department of Computer Engineering, Automation and Robotics (ICAR), University of Granada, 18071 Granada, Spain.
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf096.
Disease classification using 16S rRNA microbiome data faces challenges of high dimensionality, compositionality, and sparsity, compounded by the inherent small sample sizes in many studies. Machine learning and feature selection techniques offer potential to identify robust biomarkers and improve classification performance; however, their comparative effectiveness across diverse methods and datasets has been insufficiently explored. This study evaluates multiple feature selection techniques alongside normalization strategies, focusing on their interplay with classifier performance.
Our analyses revealed that centered log-ratio normalization improves the performance of logistic regression and support vector machine models and facilitates feature selection, whereas random forest models yield strong results using relative abundances. Interestingly, presence-absence normalization was able to achieve similar performance compared to abundance-based transformations across classifiers. Among feature selection methods, minimum redundancy maximum relevancy (mRMR) surpassed most methods in identifying compact feature sets and demonstrated performance comparable to least absolute shrinkage and selection operator (LASSO), which obtained top results requiring lower computation times. Autoencoders needed larger latent spaces to perform well and lacked interpretability, Mutual Information suffered from redundancy, and ReliefF struggled with data sparsity.
Overall, feature selection pipelines improved model focus and robustness via a massive reduction of the feature space. mRMR and LASSO emerged as the most effective methods across datasets.
使用16S rRNA微生物组数据进行疾病分类面临高维度、组成性和稀疏性等挑战,许多研究中固有的小样本量更是雪上加霜。机器学习和特征选择技术为识别稳健的生物标志物和提高分类性能提供了潜力;然而,它们在不同方法和数据集之间的比较有效性尚未得到充分探索。本研究评估了多种特征选择技术以及归一化策略,重点关注它们与分类器性能的相互作用。
我们的分析表明,中心对数比归一化提高了逻辑回归和支持向量机模型的性能,并促进了特征选择,而随机森林模型使用相对丰度时能产生较好的结果。有趣的是,与基于丰度的转换相比,存在-缺失归一化在各分类器中能够实现相似的性能。在特征选择方法中,最小冗余最大相关性(mRMR)在识别紧凑特征集方面超越了大多数方法,并且表现出与最小绝对收缩和选择算子(LASSO)相当的性能,LASSO在计算时间较短的情况下获得了最佳结果。自动编码器需要更大的潜在空间才能表现良好且缺乏可解释性,互信息存在冗余问题,ReliefF在数据稀疏性方面存在困难。
总体而言,特征选择流程通过大幅减少特征空间提高了模型的专注度和稳健性。mRMR和LASSO在各数据集中成为最有效的方法。