环境宏条形码数据集的特征选择和机器学习方法的基准分析

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.

作者信息

Zschaubitz Erik, Schröder Henning, Glackin Conor Christopher, Vogel Lukas, Labrenz Matthias, Sperlea Theodor

机构信息

Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany.

Planet AI GmbH, Warnowufer 60, Rostock, 18057, Germany.

出版信息

Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.

DOI:10.1016/j.csbj.2025.04.017

PMID:40322584

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12049816/

Abstract

Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.

摘要

像DNA宏条形码这样的新一代测序方法能够生成大量的群落组成数据集，并且近年来在生态学的许多分支中发挥了重要作用。然而，宏条形码数据集的稀疏性、组成性和高维度给数据分析带来了挑战。理论上，特征选择方法通过识别与特定任务相关的信息丰富的分类单元子集，并丢弃那些冗余或不相关的分类单元，来提高环境DNA宏条形码数据集的可分析性。然而，目前缺乏关于选择适用于特定设置的特征选择方法的通用指南。在这里，我们报告了在一个监督机器学习设置中，对13个具有不同特征的环境宏条形码数据集的特征选择方法的比较。我们通过捕获微生物群落组成与环境参数之间生态关系的能力，评估由数据预处理、特征选择和机器学习模型组成的工作流程。我们的结果表明，虽然最佳的特征选择方法取决于数据集的特征，但对于像随机森林这样的树集成模型，特征选择更有可能损害模型性能而不是提高它。此外，我们的结果表明，计算相对计数会损害模型性能，这表明需要新的方法来应对宏条形码数据的组成性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3162/12049816/8a9744d274fb/gr001.jpg

相似文献

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.

Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.

Supervised machine learning improves general applicability of eDNA metabarcoding for reservoir health monitoring.

Water Res. 2023 Nov 1;246:120686. doi: 10.1016/j.watres.2023.120686. Epub 2023 Sep 30.

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data.

BMC Bioinformatics. 2022 Mar 31;23(1):110. doi: 10.1186/s12859-022-04631-z.

Modeling the ecological status response of rivers to multiple stressors using machine learning: A comparison of environmental DNA metabarcoding and morphological data.

Water Res. 2020 Sep 15;183:116004. doi: 10.1016/j.watres.2020.116004. Epub 2020 Jun 15.

A novel firefly algorithm approach for efficient feature selection with COVID-19 dataset.

Microprocess Microsyst. 2023 Apr;98:104778. doi: 10.1016/j.micpro.2023.104778. Epub 2023 Feb 6.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

R-HEFS: Rough set based heterogeneous ensemble feature selection method for medical data classification.

Artif Intell Med. 2021 Apr;114:102049. doi: 10.1016/j.artmed.2021.102049. Epub 2021 Mar 6.

Phospholipid fatty acid (PLFA) analysis as a tool to estimate absolute abundances from compositional 16S rRNA bacterial metabarcoding data.

J Microbiol Methods. 2021 Sep;188:106271. doi: 10.1016/j.mimet.2021.106271. Epub 2021 Jun 17.

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning.

Comput Struct Biotechnol J. 2021 Apr 26;19:2256-2268. doi: 10.1016/j.csbj.2021.04.005. eCollection 2021.

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification.

Sensors (Basel). 2021 Aug 18;21(16):5571. doi: 10.3390/s21165571.

本文引用的文献

Highly-resolved interannual phytoplankton community dynamics of the coastal Northwest Atlantic.

ISME Commun. 2022 Apr 20;2(1):38. doi: 10.1038/s43705-022-00119-2.

Temperature sensitivity of the interspecific interaction strength of coastal marine fish communities.

Elife. 2023 Jul 11;12:RP85795. doi: 10.7554/eLife.85795.

Functional responses of key marine bacteria to environmental change - toward genetic counselling for coastal waters.

Front Microbiol. 2022 Dec 1;13:869093. doi: 10.3389/fmicb.2022.869093. eCollection 2022.

Applications and Comparison of Dimensionality Reduction Methods for Microbiome Data.

Front Bioinform. 2022 Feb 24;2:821861. doi: 10.3389/fbinf.2022.821861. eCollection 2022.

The relationship between land cover and microbial community composition in European lakes.

Sci Total Environ. 2022 Jun 15;825:153732. doi: 10.1016/j.scitotenv.2022.153732. Epub 2022 Feb 11.

Microbiome differential abundance methods produce different results across 38 datasets.

Nat Commun. 2022 Jan 17;13(1):342. doi: 10.1038/s41467-022-28034-z.

Machine Learning Predicts the Presence of 2,4,6-Trinitrotoluene in Sediments of a Baltic Sea Munitions Dumpsite Using Microbial Community Compositions.

Front Microbiol. 2021 Sep 29;12:626048. doi: 10.3389/fmicb.2021.626048. eCollection 2021.

Learning sparse log-ratios for high-throughput sequencing data.

Bioinformatics. 2021 Dec 22;38(1):157-163. doi: 10.1093/bioinformatics/btab645.

Making Sense of a Scent-Sensing Metaphor for Microbes and Environmental Predictions.

mSystems. 2021 Aug 31;6(4):e0099321. doi: 10.1128/mSystems.00993-21.

Quantification of the covariation of lake microbiomes and environmental variables using a machine learning-based framework.

Mol Ecol. 2021 May;30(9):2131-2144. doi: 10.1111/mec.15872. Epub 2021 Mar 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

环境宏条形码数据集的特征选择和机器学习方法的基准分析

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.

作者信息

Zschaubitz Erik, Schröder Henning, Glackin Conor Christopher, Vogel Lukas, Labrenz Matthias, Sperlea Theodor

机构信息

Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany.

Planet AI GmbH, Warnowufer 60, Rostock, 18057, Germany.

出版信息

Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.

DOI:10.1016/j.csbj.2025.04.017

PMID:40322584

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12049816/

Abstract

摘要

环境宏条形码数据集的特征选择和机器学习方法的基准分析

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

环境宏条形码数据集的特征选择和机器学习方法的基准分析

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.

作者信息

机构信息

出版信息

相似文献

本文引用的文献