宏基因组数据的MEGAN分析

MEGAN analysis of metagenomic data.

作者信息

Huson Daniel H, Auch Alexander F, Qi Ji, Schuster Stephan C

机构信息

Center for Bioinformatics, Tübingen University, Sand 14, 72076 Tübingen, Germany.

出版信息

Genome Res. 2007 Mar;17(3):377-86. doi: 10.1101/gr.5969107. Epub 2007 Jan 25.

DOI:10.1101/gr.5969107

PMID:17255551

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1800929/

Abstract

Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against sequence databases of known sequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmid and BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very high throughput are paving the way to low-cost random "shotgun" approaches. This paper introduces MEGAN, a new computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI taxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxa such that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The software allows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers. It provides graphical and statistical output for comparing different data sets. The approach is applied to several data sets, including the Sargasso Sea data set, a recently published metagenomic data set sampled from a mammoth bone, and several complete microbial genomes. Also, simulations that evaluate the performance of the approach for different read lengths are presented.

摘要

宏基因组学是指利用靶向测序或随机测序对从共同栖息地获取的生物样本的基因组内容进行研究。其目标包括了解微生物多样性的程度和作用。此类样本的分类学内容通常通过与已知序列的数据库进行比对来估计。大多数已发表的研究使用双末端读段分析、环境Fosmid和BAC克隆的完整序列或环境组装。新兴的超高通量合成测序技术正在为低成本的随机“鸟枪法”方法铺平道路。本文介绍了MEGAN，这是一种新的计算机程序，可在笔记本电脑上对大型宏基因组数据集进行分析。在预处理步骤中，使用BLAST或其他比对工具将DNA序列集与已知序列的数据库进行比对。然后使用MEGAN计算并探索数据集的分类学内容，利用NCBI分类法对结果进行汇总和排序。一种简单的最低共同祖先算法将读段分配给分类单元，使得所分配分类单元的分类学级别反映序列的保守程度。该软件无需组装或靶向特定的系统发育标记即可剖析大型数据集。它提供图形和统计输出以比较不同的数据集。该方法应用于多个数据集，包括马尾藻海数据集、最近发表的从猛犸象骨骼中采样的宏基因组数据集以及几个完整的微生物基因组。此外，还展示了评估该方法在不同读长下性能的模拟。