Kryukov Kirill, Jin Lihua, Nakagawa So
Department of Informatics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan.
Genomus Co., Ltd., Sagamihara, Kanagawa 252-0226, Japan.
Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data's (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide.
严重急性呼吸综合征冠状病毒2(SARS-CoV-2)基因组数据对于流行病学、疫苗研发以及追踪新出现的变异毒株至关重要。在疫情期间,数以百万计的SARS-CoV-2基因组已被测序。然而,从数据库下载SARS-CoV-2基因组的速度缓慢且不可靠,这主要是由于压缩方法选择欠佳。我们评估了现有的压缩器,发现核苷酸存档格式(NAF)与当前方法相比将带来显著改进。对于全球共享禽流感数据倡议组织(GISAID)的预压缩数据集,与gzip压缩数据相比,NAF将使效率提高52.2倍,与xz压缩数据相比提高3.7倍。对于日本DNA数据库(DDBJ),与gzip压缩数据相比,NAF将使通量提高40倍。对于GenBank和欧洲核苷酸档案库(ENA),与未压缩的FASTA相比,NAF将使数据分发速度加快29.3倍。本文提供了NAF安装和使用教程。在序列数据库中提供NAF下载选项将大幅节省时间、带宽和磁盘空间,并加速全球范围内的生物学和医学研究。