Belmann Peter, Osterholz Benedikt, Kleinbölting Nils, Pühler Alfred, Schlüter Andreas, Sczyrba Alexander
IBG-5: Computational Metagenomics, Institute of Bio- and Geosciences (IBG), Research Center Jülich GmbH, D-52428 Jülich, Germany.
Computational Metagenomics Group, Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Universitätsstrasse 25, D-33615 Bielefeld, Germany.
NAR Genom Bioinform. 2025 Jul 17;7(3):lqaf093. doi: 10.1093/nargab/lqaf093. eCollection 2025 Sep.
The metagenome analysis of complex environments with thousands of datasets, such as those in the Sequence Read Archive, requires substantial computational resources for it to be completed within a reasonable time frame. Efficient use of infrastructure is essential, and analyses must be fully reproducible with publicly available workflows to ensure transparency. Here, we introduce the Metagenomics-Toolkit, a scalable, data-agnostic workflow that automates the analysis of short and long metagenomic reads from Illumina and Oxford Nanopore Technology devices, respectively. The Metagenomics-Toolkit provides standard features such as quality control, assembly, binning, and annotation, along with unique capabilities including plasmid identification, recovery of unassembled microbial community members, and discovery of microbial interdependencies through dereplication, co-occurrence, and genome-scale metabolic modeling. Additionally, the Metagenomics-Toolkit includes a machine learning-optimized assembly step that adjusts peak RAM usage to match actual requirements, reducing the need for high-memory hardware. It can be executed on user workstations and includes optimizations for efficient cloud-based cluster execution. We compare the Metagenomics-Toolkit with five widely used metagenomics workflows and demonstrate its capabilities on 757 sewage metagenome datasets to investigate a possible sewage core microbiome. The Metagenomics-Toolkit is open source and available at https://github.com/metagenomics/metagenomics-tk.
对包含数千个数据集的复杂环境(如序列读取存档中的数据集)进行宏基因组分析,需要大量计算资源才能在合理的时间范围内完成。有效利用基础设施至关重要,并且分析必须通过公开可用的工作流程完全可重现,以确保透明度。在这里,我们介绍宏基因组学工具包,这是一种可扩展的、数据无关的工作流程,分别自动分析来自Illumina和牛津纳米孔技术设备的短和长宏基因组读数。宏基因组学工具包提供了诸如质量控制、组装、分箱和注释等标准功能,以及独特的功能,包括质粒鉴定、未组装微生物群落成员的恢复,以及通过去重复、共现和基因组规模代谢建模发现微生物相互依赖性。此外,宏基因组学工具包包括一个经过机器学习优化的组装步骤,该步骤可调整峰值RAM使用量以匹配实际需求,从而减少对高内存硬件的需求。它可以在用户工作站上执行,并包括针对基于云的集群高效执行的优化。我们将宏基因组学工具包与五个广泛使用的宏基因组学工作流程进行了比较,并在757个污水宏基因组数据集上展示了其功能,以研究可能的污水核心微生物组。宏基因组学工具包是开源的,可在https://github.com/metagenomics/metagenomics-tk上获取。