Biomedical Informatics Centre, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India.
Biomedical Informatics Centre; Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India.
Indian J Med Res. 2020 Jan;151(1):93-103. doi: 10.4103/ijmr.IJMR_220_18.
BACKGROUND & OBJECTIVES: For bacterial community analysis, 16S rRNA sequences are subjected to taxonomic classification through comparison with one of the three commonly used databases [Greengenes, SILVA and Ribosomal Database Project (RDP)]. It was hypothesized that a unified database containing fully annotated, non-redundant sequences from all the three databases, might provide better taxonomic classification during analysis of 16S rRNA sequence data. Hence, a unified 16S rRNA database was constructed and its performance was assessed by using it with four different taxonomic assignment methods, and for data from various hypervariable regions (HVRs) of 16S rRNA gene.
We constructed a unified 16S rRNA database (16S-UDb) by merging non-ambiguous, fully annotated, full-length 16S rRNA sequences from the three databases and compared its performance in taxonomy assignment with that of three original databases. This was done using four different taxonomy assignment methods [mothur Naïve Bayesian Classifier (mothur-nbc), RDP Naïve Bayesian Classifier (rdp-nbc), UCLUST, SortMeRNA] and data from 13 regions of 16S rRNA [seven hypervariable regions (HVR) (V2-V8) and six pairs of adjacent HVRs].
Our unified 16S rRNA database contained 13,078 full-length, fully annotated 16S rRNA sequences. It could assign genus and species to larger proportions (90.05 and 46.82%, respectively, when used with mothur-nbc classifier and the V2+V3 region) of sequences in the test database than the three original 16S rRNA databases (70.88-87.20% and 10.23-24.28%, respectively, with the same classifier and region).
INTERPRETATION & CONCLUSIONS: Our results indicate that for analysis of bacterial mixtures, sequencing of V2-V3 region of 16S rRNA followed by analysis of the data using the mothur-nbc classifier and our 16S-UDb database may be preferred.
在细菌群落分析中,通过与三个常用数据库(Greengenes、SILVA 和核糖体数据库项目(RDP))之一进行比较,对 16S rRNA 序列进行分类学分类。假设一个包含来自所有三个数据库的完全注释、非冗余序列的统一数据库,在分析 16S rRNA 序列数据时,可能会提供更好的分类学分类。因此,构建了一个统一的 16S rRNA 数据库,并使用四种不同的分类分配方法和来自 16S rRNA 基因的不同高变区(HVR)的数据集来评估其性能。
通过合并三个数据库中无歧义、完全注释、全长 16S rRNA 序列,构建了一个统一的 16S rRNA 数据库(16S-UDb),并将其在分类学分配方面的性能与三个原始数据库进行了比较。这是使用四种不同的分类分配方法( mothur 朴素贝叶斯分类器(mothur-nbc)、RDP 朴素贝叶斯分类器(rdp-nbc)、 UCLUST、SortMeRNA)和来自 16S rRNA 的 13 个区域(7 个高变区(HVR)(V2-V8)和 6 对相邻 HVR)的数据完成的。
我们的统一 16S rRNA 数据库包含 13078 个全长、完全注释的 16S rRNA 序列。当使用 mothur-nbc 分类器和 V2+V3 区域时,它可以将属和种分配给更大比例的测试数据库中的序列(分别为 90.05%和 46.82%),而三个原始 16S rRNA 数据库的分配比例(分别为 70.88%-87.20%和 10.23%-24.28%),使用相同的分类器和区域。
我们的结果表明,在分析细菌混合物时,最好对 16S rRNA 的 V2-V3 区域进行测序,然后使用 mothur-nbc 分类器和我们的 16S-UDb 数据库对数据进行分析。