Department of Statistics, Northwestern University, Evanston, Illinois, United States of America.
PLoS One. 2012;7(10):e46450. doi: 10.1371/journal.pone.0046450. Epub 2012 Oct 1.
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
下一代测序技术的出现极大地推动了宏基因组学领域的发展,该领域研究直接从环境中回收的遗传物质。对宏基因组样本的基因组组成进行特征描述对于理解微生物群落的结构至关重要。通过将序列读取与参考数据库中编目的已知序列进行同源搜索,可以识别和定量包含在宏基因组样本中的多个基因组。传统上,具有多个基因组命中的读取被分配给非特异性或分类树的高级别,从而影响对样本中存在的多个基因组相对丰度的准确估计。我们提出了一种统计框架,而不是像许多现有方法那样逐个将读取分配给分类树,该框架用于对具有命中的鉴定候选基因组进行建模。在获得每个基因组生成的读取的估计比例之后,根据考虑序列比对得分和估计的基因组丰度的估计概率,将读取分配给候选基因组和分类树。所提出的方法在模拟数据集和两个真实数据集上进行了全面测试。它非常准确地将读取分配给低分类等级。我们的宏基因组读取分类分配的统计方法 TAMER 是用 R 实现的,可在 http://faculty.wcas.northwestern.edu/hji403/MetaR.htm 上获得。