Chouvarine Philippe, Wiehlmann Lutz, Moran Losada Patricia, DeLuca David S, Tümmler Burkhard
Department of Pediatrics, Baylor College of Medicine, Houston, Texas 77030, United States of America.
Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Hannover D-30625, Germany.
PLoS One. 2016 Oct 19;11(10):e0165015. doi: 10.1371/journal.pone.0165015. eCollection 2016.
Ever-increasing affordability of next-generation sequencing makes whole-metagenome sequencing an attractive alternative to traditional 16S rDNA, RFLP, or culturing approaches for the analysis of microbiome samples. The advantage of whole-metagenome sequencing is that it allows direct inference of the metabolic capacity and physiological features of the studied metagenome without reliance on the knowledge of genotypes and phenotypes of the members of the bacterial community. It also makes it possible to overcome problems of 16S rDNA sequencing, such as unknown copy number of the 16S gene and lack of sufficient sequence similarity of the "universal" 16S primers to some of the target 16S genes. On the other hand, next-generation sequencing suffers from biases resulting in non-uniform coverage of the sequenced genomes. To overcome this difficulty, we present a model of GC-bias in sequencing metagenomic samples as well as filtration and normalization techniques necessary for accurate quantification of microbial organisms. While there has been substantial research in normalization and filtration of read-count data in such techniques as RNA-seq or Chip-seq, to our knowledge, this has not been the case for the field of whole-metagenome shotgun sequencing. The presented methods assume that complete genome references are available for most microorganisms of interest present in metagenomic samples. This is often a valid assumption in such fields as medical diagnostics of patient microbiota. Testing the model on two validation datasets showed four-fold reduction in root-mean-square error compared to non-normalized data in both cases. The presented methods can be applied to any pipeline for whole metagenome sequencing analysis relying on complete microbial genome references. We demonstrate that such pre-processing reduces the number of false positive hits and increases accuracy of abundance estimates.
下一代测序技术的成本不断降低,使得全基因组测序成为分析微生物组样本的一种有吸引力的替代传统16S rDNA、RFLP或培养方法。全基因组测序的优势在于,它可以直接推断所研究宏基因组的代谢能力和生理特征,而无需依赖细菌群落成员的基因型和表型知识。它还能够克服16S rDNA测序的问题,如16S基因拷贝数未知以及“通用”16S引物与某些目标16S基因缺乏足够的序列相似性。另一方面,下一代测序存在偏差,导致测序基因组的覆盖不均匀。为了克服这一困难,我们提出了一种在测序宏基因组样本时的GC偏差模型以及准确量化微生物所需的过滤和归一化技术。虽然在RNA-seq或Chip-seq等技术中,对读取计数数据的归一化和过滤已有大量研究,但据我们所知,在全基因组鸟枪法测序领域并非如此。所提出的方法假设宏基因组样本中存在的大多数感兴趣的微生物都有完整的基因组参考。在患者微生物群的医学诊断等领域,这通常是一个有效的假设。在两个验证数据集上对该模型进行测试表明,与未归一化数据相比,两种情况下均方根误差降低了四倍。所提出的方法可应用于任何依赖完整微生物基因组参考的全基因组测序分析流程。我们证明,这种预处理减少了假阳性命中的数量,并提高了丰度估计的准确性。