Tang Deyou, Li Yucheng, Tan Daqiang, Fu Juan, Tang Yelei, Lin Jiabin, Zhao Rong, Du Hongli, Zhao Zhongming
School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
Bioinformatics. 2022 Jan 27;38(4):933-940. doi: 10.1093/bioinformatics/btab797.
The k-mer frequency in whole genome sequences provides researchers with an insightful perspective on genomic complexity, comparative genomics, metagenomics and phylogeny. The current k-mer counting tools are typically slow, and they require large memory and hard disk for assembled genome analysis.
We propose a novel and ultra-fast k-mer counting algorithm, KCOSS, to fulfill k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool and cuckoo hash table. We optimize running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously. KCOSS was comparatively tested with Jellyfish2, CHTKC and KMC3 on seven assembled genomes and three sequencing datasets in running time, memory consumption, and hard disk occupation. The experimental results show that KCOSS counts k-mer with less memory and disk while having a shorter running time on assembled genomes. KCOSS can be used to calculate the k-mer frequency not only for assembled genomes but also for sequencing data.
The KCOSS software is implemented in C++. It is freely available on GitHub: https://github.com/kcoss-2021/KCOSS.
Supplementary data are available at Bioinformatics online.
全基因组序列中的k-mer频率为研究人员提供了关于基因组复杂性、比较基因组学、宏基因组学和系统发育的深刻见解。当前的k-mer计数工具通常速度较慢,并且在进行组装基因组分析时需要大量内存和硬盘空间。
我们提出了一种新颖且超快速的k-mer计数算法KCOSS,主要通过分段布隆过滤器、无锁队列、无锁线程池和布谷鸟哈希表来实现对组装基因组的k-mer计数。我们通过回收内存块、将多个连续的首次出现的k-mer合并为C-read,并异步将一组C-read写入磁盘来优化运行时间和内存消耗。在运行时间、内存消耗和硬盘占用方面,我们在七个组装基因组和三个测序数据集上对KCOSS与Jellyfish2、CHTKC和KMC3进行了比较测试。实验结果表明,KCOSS在组装基因组上以更少的内存和磁盘占用进行k-mer计数,同时运行时间更短。KCOSS不仅可用于计算组装基因组的k-mer频率,还可用于测序数据。
KCOSS软件用C++实现。它可在GitHub上免费获取:https://github.com/kcoss-2021/KCOSS。
补充数据可在《生物信息学》在线获取。