Pockrandt Christopher, Zimin Aleksey V, Salzberg Steven L
Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA.
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
J Open Source Softw. 2022;7(80). doi: 10.21105/joss.04908. Epub 2022 Dec 28.
Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all from all genomes that the users want to be able to detect, where = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system.
The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.
Kraken和KrakenUniq是用于宏基因组学序列分类的广泛使用的工具。这些系统的一个关键要求是一个数据库,该数据库包含用户想要能够检测到的所有基因组的所有 ,默认情况下 = 31。这个数据库可能非常大,很容易超过100千兆字节(GB),有时甚至达到400 GB。以前,Kraken和KrakenUniq需要将整个数据库加载到主内存(RAM)中,如果RAM不足,它们会使用内存映射,这会显著增加大型数据集的运行时间。我们在KrakenUniq中实现了一种新算法,使其能够分块加载和处理数据库,运行时间仅适度增加。这一改进现在使得在几乎任何计算机(甚至是笔记本电脑)上运行KrakenUniq处理非常大的数据集和巨大的数据库成为可能,同时提供与以前系统相同的非常高的分类准确性。
KrakenUniq软件对宏基因组样本中的 reads 进行分类,以确定样本中存在哪些生物体并估计它们的丰度。该软件被医学诊断、微生物组和环境研究领域的研究人员和临床医生广泛使用。KrakenUniq使用的典型数据库大小从几十GB到数百GB不等。原始的KrakenUniq代码需要将整个数据库加载到RAM中,这需要昂贵的高内存服务器才能高效运行。如果用户没有足够的物理RAM来加载整个数据库,KrakenUniq会求助于对数据库进行内存映射,这会显著增加运行时间,通常会增加100倍以上。本文描述的新功能使无法使用高内存服务器的用户能够高效运行KrakenUniq,CPU时间性能提高了3到4倍,从100倍以上降至现在的水平。