Cancer Data Science Laboratory National, Cancer Insitute, National Institutes of Health, Bethesda, 20892, MD, USA.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
BMC Bioinformatics. 2020 May 24;21(1):211. doi: 10.1186/s12859-020-3537-3.
GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions.
We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally.
VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.
GenBank 中包含超过 300 万条病毒序列。美国国家生物技术信息中心(NCBI)曾提供了一种用于验证和注释流感病毒序列的工具,用于检查向 GenBank 的提交内容。在此项目之前,对于非流感病毒序列的提交,没有类似的工具在使用。
我们开发了一个名为 VADR(病毒注释定义器)的系统,用于验证和注释 GenBank 提交中的病毒序列。注释系统基于使用从已审定 RefSeq 构建的模型对输入核苷酸序列的分析。隐马尔可夫模型用于通过确定与 RefSeq 最相似的 RefSeq 来对序列进行分类,并且基于全长序列与协方差模型的核苷酸比对,将 RefSeq 的特征注释映射。通过使用 BLAST 进行核苷酸到蛋白质的比对来验证序列编码的预测蛋白质。该系统确定了 43 种“警报”类型,这些警报(与以前的基于 BLAST 的系统不同)为提交具有意外特征的序列的研究人员提供了确定性和严格的反馈。VADR 已集成到 GenBank 的提交处理管道中,允许通过所有测试的病毒提交自动被接受和自动注释,而无需任何人工(GenBank 索引器)干预。与以前的提交检查系统不同,VADR 可免费使用(https://github.com/nawrockie/vadr),可进行本地安装和使用。自 2018 年 5 月以来,VADR 一直用于诺如病毒提交,自 2019 年 1 月以来,VADR 一直用于登革热病毒提交。自 2020 年 3 月以来,VADR 也用于检查 SARS-CoV-2 序列提交。其他提交数量较高的病毒将逐步添加。
VADR 提高了检查 GenBank 中非流感病毒提交的速度,并提高了 GenBank 注释的内容和质量。软件的可用性和可移植性允许研究人员在提交病毒序列之前运行 GenBank 检查,从而有信心他们的提交将立即被接受,而无需与 GenBank 工作人员联系。反过来,VADR 的采用使 GenBank 工作人员可以腾出更多时间用于除检查常规病毒序列提交之外的其他服务。