Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
BMC Bioinformatics. 2021 Aug 12;22(1):400. doi: 10.1186/s12859-021-04316-z.
The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron.
To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa.
Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.
核糖体 RNA 基因(rRNA)的 DNA 序列通常被用作识别物种的标记,包括可能结合了许多生物群落的宏基因组样本。16S 小亚基核糖体 RNA(SSU rRNA)基因通常用于鉴定细菌和古菌物种。核 18S SSU rRNA 基因和 28S 大亚基(LSU)rRNA 基因已被用作 DNA 条形码,并用于不同真核生物分类群的系统发育研究。由于其受欢迎程度,国家生物技术信息中心(NCBI)收到的 rRNA 序列提交和 BLAST 查询数量不成比例。这些序列在质量、长度、来源(核、线粒体、质体)和生物来源方面存在差异,并且可以代表核糖体基因座的任何区域。
为了提高质量、来源和基因座边界的及时验证,我们开发了 Ribovore,这是一个用于 rRNA 序列分析的软件包。ribotyper 和 ribosensor 程序用于验证细菌和古菌 SSU rRNA 的传入序列。ribodbmaker 程序用于从不同分类群创建高质量的 rRNA 数据集。关键算法步骤包括将候选序列与 rRNA 序列特征隐马尔可夫模型(HMM)和 rRNA 序列和二级结构保守性的协方差模型进行比较,以及其他测试。使用 Ribovore 创建和维护的九个免费可用的 blastn rRNA 数据库用于检查传入的 GenBank 提交,并用于 NCBI 的 blastn 浏览器界面。自 2018 年以来,Ribovore 已用于分析提交给 GenBank 的超过 5000 万条原核 SSU rRNA 序列,并从 8350 个分类群的模式材料中选择至少 10435 个真菌 rRNA RefSeq 记录。
Ribovore 将单序列和基于特征的方法相结合,以提高 GenBank 中 rRNA 序列的处理和分析。它是一个独立的、可移植的和可扩展的 rRNA 序列对齐、分类和验证软件包。鼓励计划向 GenBank 提交 SSU rRNA 序列的研究人员下载并使用 Ribovore 来分析他们的序列,以确定哪些序列可能会自动被 GenBank 接受。