Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India.
Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
高通量测序技术的快速发展意味着在单个研究中可以产生数百千兆字节的测序数据。许多生物信息学工具都需要对 DNA/RNA 测序读取中的长度为 k 的子字符串进行计数,这些应用包括基因组和转录组组装、错误纠正、多序列比对和重复检测。最近,已经开发了几种技术来对大型测序数据集进行 k-mer 计数,这在执行此功能所需的时间和内存之间存在权衡。我们评估了几种 k-mer 计数程序,并根据运行时和内存使用情况评估了它们的相对性能。我们还考虑了其他参数,如磁盘使用情况、准确性、并行性、压缩输入的影响、大 k 值计数方面的性能以及应用程序对更大数据集的可扩展性。我们针对当前最先进程序的设置提出了具体建议,并提出了进一步发展的建议。