Wang Mingjie, Ye Yuzhen, Tang Haixu
School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.
J Comput Biol. 2012 Jun;19(6):814-25. doi: 10.1089/cmb.2012.0058.
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
下一代测序(NGS)技术在宏基因组学中的广泛应用带来了许多计算挑战。宏基因组学中的一个基本问题是估计微生物群落的分类组成,这可以通过将从群落中获得的鸟枪法 reads 映射到先前已表征的微生物基因组,然后根据映射 reads 的数量对这些物种进行定量分析来实现。然而,这个过程并不像乍看起来那么简单。鸟枪法宏基因组数据集通常包含来自许多密切相关的微生物物种(例如,同一属内)或菌株(例如,同一物种内)的 DNA 序列,因此当一个特定的 read 可以以高相似性映射到多个基因组共享的共同区域时,通常很难确定它是从哪个物种/菌株中采样的。此外,在同一物种内的个体基因组之间观察到高度的基因组变异,在 reads 映射过程中很难将其与种间变异区分开来。为了解决这些问题,一种常用的方法是仅基于映射到同一属的所有物种的 reads 来在属水平上定量分类分布;或者,将 reads 映射到一组代表性基因组,每个基因组被选来代表一个不同的属。在这里,我们介绍一种新的方法,通过将 reads 映射到由 de Bruijn 图表示的它们的基因组来估计同一属内密切相关物种的数量,其中它们之间的共同基因组区域被压缩。使用模拟和真实的宏基因组数据集,我们表明 de Bruijn 图方法相对于现有方法有几个优点,包括(1)它避免了鸟枪法 reads 对不同基因组中共同区域的多个副本的冗余映射,以及(2)它对密切相关物种(甚至同一物种内的菌株)导致更准确的定量。