宏基因组数据集定量与比较分析用户指南

A user's guide to quantitative and comparative analysis of metagenomic datasets.

作者信息

Luo Chengwei, Rodriguez-R Luis M, Konstantinidis Konstantinos T

机构信息

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.

出版信息

Methods Enzymol. 2013;531:525-47. doi: 10.1016/B978-0-12-407863-5.00023-X.

DOI:10.1016/B978-0-12-407863-5.00023-X

PMID:24060135

Abstract

Metagenomics has revolutionized microbiological studies during the past decade and provided new insights into the diversity, dynamics, and metabolic potential of natural microbial communities. However, metagenomics still represents a field in development, and standardized tools and approaches to handle and compare metagenomes have not been established yet. An important reason accounting for the latter is the continuous changes in the type of sequencing data available, for example, long versus short sequencing reads. Here, we provide a guide to bioinformatic pipelines developed to accomplish the following tasks, focusing primarily on those developed by our team: (i) assemble a metagenomic dataset; (ii) determine the level of sequence coverage obtained and the amount of sequencing required to obtain complete coverage; (iii) identify the taxonomic affiliation of a metagenomic read or assembled contig; and (iv) determine differentially abundant genes, pathways, and species between different datasets. Most of these pipelines do not depend on the type of sequences available or can be easily adjusted to fit different types of sequences, and are freely available (for instance, through our lab Web site: http://www.enve-omics.gatech.edu/). The limitations of current approaches, as well as the computational aspects that can be further improved, will also be briefly discussed. The work presented here provides practical guidelines on how to perform metagenomic analysis of microbial communities characterized by varied levels of diversity and establishes approaches to handle the resulting data, independent of the sequencing platform employed.

摘要

在过去十年中，宏基因组学彻底改变了微生物学研究，并为自然微生物群落的多样性、动态变化及代谢潜力提供了新的见解。然而，宏基因组学仍是一个尚在发展的领域，处理和比较宏基因组的标准化工具及方法尚未确立。造成后者的一个重要原因是可用测序数据类型持续变化，例如长测序读段与短测序读段。在此，我们提供一份关于为完成以下任务而开发的生物信息学流程指南，主要聚焦于我们团队开发的流程：（i）组装宏基因组数据集；（ii）确定获得的序列覆盖水平以及获得完整覆盖所需的测序量；（iii）确定宏基因组读段或组装重叠群的分类归属；（iv）确定不同数据集之间差异丰富的基因、通路及物种。这些流程大多不依赖可用序列的类型，或可轻松调整以适应不同类型的序列，并且可免费获取（例如，通过我们实验室网站：http://www.enve-omics.gatech.edu/）。本文还将简要讨论当前方法的局限性以及可进一步改进的计算方面。本文介绍的工作提供了关于如何对具有不同多样性水平的微生物群落进行宏基因组分析的实用指南，并建立了处理所得数据的方法，而与所采用的测序平台无关。