Clum Alicia, Huntemann Marcel, Bushnell Brian, Foster Brian, Foster Bryce, Roux Simon, Hajek Patrick P, Varghese Neha, Mukherjee Supratim, Reddy T B K, Daum Chris, Yoshinaga Yuko, O'Malley Ronan, Seshadri Rekha, Kyrpides Nikos C, Eloe-Fadrosh Emiley A, Chen I-Min A, Copeland Alex, Ivanova Natalia N
Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA.
mSystems. 2021 May 18;6(3):e00804-20. doi: 10.1128/mSystems.00804-20.
The DOE Joint Genome Institute (JGI) Metagenome Workflow performs metagenome data processing, including assembly; structural, functional, and taxonomic annotation; and binning of metagenomic data sets that are subsequently included into the Integrated Microbial Genomes and Microbiomes (IMG/M) (I.-M. A. Chen, K. Chu, K. Palaniappan, A. Ratner, et al., Nucleic Acids Res, 49:D751-D763, 2021, https://doi.org/10.1093/nar/gkaa939) comparative analysis system and provided for download via the JGI data portal (https://genome.jgi.doe.gov/portal/). This workflow scales to run on thousands of metagenome samples per year, which can vary by the complexity of microbial communities and sequencing depth. Here, we describe the different tools, databases, and parameters used at different steps of the workflow to help with the interpretation of metagenome data available in IMG and to enable researchers to apply this workflow to their own data. We use 20 publicly available sediment metagenomes to illustrate the computing requirements for the different steps and highlight the typical results of data processing. The workflow modules for read filtering and metagenome assembly are available as a workflow description language (WDL) file (https://code.jgi.doe.gov/BFoster/jgi_meta_wdl). The workflow modules for annotation and binning are provided as a service to the user community at https://img.jgi.doe.gov/submit and require filling out the project and associated metadata descriptions in the Genomes OnLine Database (GOLD) (S. Mukherjee, D. Stamatis, J. Bertsch, G. Ovchinnikova, et al., Nucleic Acids Res, 49:D723-D733, 2021, https://doi.org/10.1093/nar/gkaa983). The DOE JGI Metagenome Workflow is designed for processing metagenomic data sets starting from Illumina fastq files. It performs data preprocessing, error correction, assembly, structural and functional annotation, and binning. The results of processing are provided in several standard formats, such as fasta and gff, and can be used for subsequent integration into the Integrated Microbial Genomes and Microbiomes (IMG/M) system where they can be compared to a comprehensive set of publicly available metagenomes. As of 30 July 2020, 7,155 JGI metagenomes have been processed by the DOE JGI Metagenome Workflow. Here, we present a metagenome workflow developed at the JGI that generates rich data in standard formats and has been optimized for downstream analyses ranging from assessment of the functional and taxonomic composition of microbial communities to genome-resolved metagenomics and the identification and characterization of novel taxa. This workflow is currently being used to analyze thousands of metagenomic data sets in a consistent and standardized manner.
美国能源部联合基因组研究所(JGI)宏基因组工作流程可进行宏基因组数据处理,包括组装、结构注释、功能注释、分类注释以及对宏基因组数据集进行分箱,随后这些数据集会被纳入综合微生物基因组与微生物群落(IMG/M)(I.-M. A. 陈、K. 朱、K. 帕拉尼亚潘、A. 拉特纳等人,《核酸研究》,49:D751 - D763,2021,https://doi.org/10.1093/nar/gkaa939)比较分析系统,并可通过JGI数据门户(https://genome.jgi.doe.gov/portal/)进行下载。此工作流程能够扩展以每年运行数千个宏基因组样本,样本数量会因微生物群落的复杂性和测序深度而有所不同。在此,我们描述了工作流程不同步骤中使用的不同工具、数据库和参数,以帮助解读IMG中可用的宏基因组数据,并使研究人员能够将此工作流程应用于他们自己的数据。我们使用20个公开可用的沉积物宏基因组来说明不同步骤的计算需求,并突出数据处理的典型结果。用于读取过滤和宏基因组组装的工作流程模块以工作流描述语言(WDL)文件(https://code.jgi.doe.gov/BFoster/jgi_meta_wdl)的形式提供。用于注释和分箱的工作流程模块在https://img.jgi.doe.gov/submit作为一项服务提供给用户群体,并且需要在基因组在线数据库(GOLD)(S. 慕克吉、D. 斯塔马蒂斯、J. 贝奇、G. 奥夫钦尼科娃等人,《核酸研究》,49:D723 - D733,2021,https://doi.org/10.1093/nar/gkaa983)中填写项目及相关元数据描述。美国能源部JGI宏基因组工作流程旨在从Illumina fastq文件开始处理宏基因组数据集。它执行数据预处理、纠错、组装、结构和功能注释以及分箱。处理结果以多种标准格式提供,如fasta和gff,可用于随后整合到综合微生物基因组与微生物群落(IMG/M)系统中,在那里可与一组全面的公开可用宏基因组进行比较。截至2020年7月30日,美国能源部JGI宏基因组工作流程已处理了7155个JGI宏基因组。在此,我们展示了JGI开发的一个宏基因组工作流程,该流程以标准格式生成丰富的数据,并针对从评估微生物群落的功能和分类组成到基因组解析宏基因组学以及新分类单元的鉴定和表征等下游分析进行了优化。此工作流程目前正用于以一致且标准化的方式分析数千个宏基因组数据集。