使用VADR对GenBank进行更快的SARS-CoV-2序列验证和注释。

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

作者信息

Nawrocki Eric P

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894 USA.

出版信息

bioRxiv. 2022 Apr 27:2022.04.25.489427. doi: 10.1101/2022.04.25.489427.

DOI:10.1101/2022.04.25.489427

PMID:35547842

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9094095/

Abstract

BACKGROUND

In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation.

RESULTS

VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using and , increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host.

CONCLUSION

VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available ( https://github.com/ncbi/vadr ) for local installation and use.

摘要

背景

2020年和2021年，超过150万个严重急性呼吸综合征冠状病毒2（SARS-CoV-2）序列被提交至GenBank。GenBank用于自动验证和注释传入病毒序列的VADR（病毒注释定义器）软件包的初始版本（v1.0）速度过慢且内存占用量大，无法在合理时间内处理数以千计的SARS-CoV-2序列。此外，许多SARS-CoV-2序列中常见的长段模糊N核苷酸会妨碍VADR进行准确的验证和注释。

结果

VADR已更新，以便更准确、快速地注释SARS-CoV-2序列。现在可以识别连续的Ns片段，并暂时用预期的核苷酸替换，以方便处理，并且使用和对最慢的步骤进行了全面改进，提高了速度，将每个线程的内存需求从64GB减少到2GB，并允许在每个主机的多个处理器上进行简单的粗粒度并行化。

结论

对于处理提交至GenBank的SARS-CoV-2序列，VADR现在的速度比2020年初快了近1000倍。自2020年6月以来，它已用于筛选和注释超过150万个SARS-CoV-2序列，现在它的效率足以应对目前每月数十万条提交序列的速度。1.4.1版本可在（https://github.com/ncbi/vadr ）免费获取，用于本地安装和使用。