Biodiversity Research Center, Academia Sinica, Taipei, Taiwan.
BMC Bioinformatics. 2010 Nov 18;11:565. doi: 10.1186/1471-2105-11-565.
Investigation of metagenomes provides greater insight into uncultured microbial communities. The improvement in sequencing technology, which yields a large amount of sequence data, has led to major breakthroughs in the field. However, at present, taxonomic binning tools for metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST cut-offs. In an attempt to provide a comprehensive overview of metagenomic data, we re-analyzed the discarded metagenomes by using less stringent cut-offs. Additionally, we introduced a new criterion, namely, the evolutionary conservation of adjacency between neighboring genes. To evaluate the feasibility of our approach, we re-analyzed discarded contigs and singletons from several environments with different levels of complexity. We also compared the consistency between our taxonomic binning and those reported in the original studies.
Among the discarded data, we found that 23.7 ± 3.9% of singletons and 14.1 ± 1.0% of contigs were assigned to taxa. The recovery rates for singletons were higher than those for contigs. The Pearson correlation coefficient revealed a high degree of similarity (0.94 ± 0.03 at the phylum rank and 0.80 ± 0.11 at the family rank) between the proposed taxonomic binning approach and those reported in original studies. In addition, an evaluation using simulated data demonstrated the reliability of the proposed approach.
Our findings suggest that taking account of conserved neighboring gene adjacency improves taxonomic assignment when analyzing metagenomes using Sanger sequencing. In other words, utilizing the conserved gene order as a criterion will reduce the amount of data discarded when analyzing metagenomes.
对宏基因组的研究可以更深入地了解未培养的微生物群落。测序技术的进步产生了大量的序列数据,这使得该领域取得了重大突破。然而,目前宏基因组的分类学binning 工具由于 BLAST 截断的严格性,丢弃了 30-40%的 Sanger 测序数据。为了全面概述宏基因组数据,我们使用不那么严格的截断重新分析了丢弃的宏基因组。此外,我们引入了一个新的标准,即相邻基因之间的进化保守性邻接。为了评估我们的方法的可行性,我们重新分析了来自不同复杂程度环境的丢弃的 contigs 和 singletons。我们还比较了我们的分类 binning 与原始研究报告之间的一致性。
在丢弃的数据中,我们发现 23.7±3.9%的 singletons 和 14.1±1.0%的 contigs 被分配到了分类单元。singletons 的回收率高于 contigs。Pearson 相关系数显示,所提出的分类 binning 方法与原始研究报告中的方法具有高度的相似性(门水平为 0.94±0.03,科水平为 0.80±0.11)。此外,使用模拟数据进行的评估表明了该方法的可靠性。
我们的研究结果表明,在使用 Sanger 测序分析宏基因组时,考虑保守的相邻基因邻接可以提高分类分配的准确性。换句话说,利用保守的基因顺序作为标准将减少分析宏基因组时丢弃的数据量。