Zschaubitz Erik, Schröder Henning, Glackin Conor Christopher, Vogel Lukas, Labrenz Matthias, Sperlea Theodor
Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany.
Planet AI GmbH, Warnowufer 60, Rostock, 18057, Germany.
Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.
Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.
像DNA宏条形码这样的新一代测序方法能够生成大量的群落组成数据集,并且近年来在生态学的许多分支中发挥了重要作用。然而,宏条形码数据集的稀疏性、组成性和高维度给数据分析带来了挑战。理论上,特征选择方法通过识别与特定任务相关的信息丰富的分类单元子集,并丢弃那些冗余或不相关的分类单元,来提高环境DNA宏条形码数据集的可分析性。然而,目前缺乏关于选择适用于特定设置的特征选择方法的通用指南。在这里,我们报告了在一个监督机器学习设置中,对13个具有不同特征的环境宏条形码数据集的特征选择方法的比较。我们通过捕获微生物群落组成与环境参数之间生态关系的能力,评估由数据预处理、特征选择和机器学习模型组成的工作流程。我们的结果表明,虽然最佳的特征选择方法取决于数据集的特征,但对于像随机森林这样的树集成模型,特征选择更有可能损害模型性能而不是提高它。此外,我们的结果表明,计算相对计数会损害模型性能,这表明需要新的方法来应对宏条形码数据的组成性。