用于高通量测序的 k-mer 计数方法的基准研究。

A benchmark study of k-mer counting methods for high-throughput sequencing.

机构信息

Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India.

出版信息

Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.

DOI:10.1093/gigascience/giy125

PMID:30346548

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6280066/

Abstract

The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.

摘要

高通量测序技术的快速发展意味着在单个研究中可以产生数百千兆字节的测序数据。许多生物信息学工具都需要对 DNA/RNA 测序读取中的长度为 k 的子字符串进行计数，这些应用包括基因组和转录组组装、错误纠正、多序列比对和重复检测。最近，已经开发了几种技术来对大型测序数据集进行 k-mer 计数，这在执行此功能所需的时间和内存之间存在权衡。我们评估了几种 k-mer 计数程序，并根据运行时和内存使用情况评估了它们的相对性能。我们还考虑了其他参数，如磁盘使用情况、准确性、并行性、压缩输入的影响、大 k 值计数方面的性能以及应用程序对更大数据集的可扩展性。我们针对当前最先进程序的设置提出了具体建议，并提出了进一步发展的建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fef2/6280066/91016a7da73d/giy125fig1.jpg

相似文献

A benchmark study of k-mer counting methods for high-throughput sequencing.

Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.

DSK: k-mer counting with very low memory usage.

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.

Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

KCMBT: a k-mer Counter based on Multiple Burst Trees.

Bioinformatics. 2016 Sep 15;32(18):2783-90. doi: 10.1093/bioinformatics/btw345. Epub 2016 Jun 9.

Efficient counting of k-mers in DNA sequences using a bloom filter.

BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

HISEA: HIerarchical SEed Aligner for PacBio data.

BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.

引用本文的文献

Poplar: a phylogenomics pipeline.

Bioinform Adv. 2025 May 6;5(1):vbaf104. doi: 10.1093/bioadv/vbaf104. eCollection 2025.

The genomes of the most diverse AA genome rice species provide a resource for rice improvement and studies of rice evolution and domestication.

BMC Genomics. 2025 Jan 21;26(1):54. doi: 10.1186/s12864-025-11246-0.

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.

Interdiscip Sci. 2024 Oct 21. doi: 10.1007/s12539-024-00659-2.

The genomes of Australian wild limes.

Plant Mol Biol. 2024 Sep 24;114(5):102. doi: 10.1007/s11103-024-01502-4.

A survey of k-mer methods and applications in bioinformatics.

Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.

The genome of Citrus australasica reveals disease resistance and other species specific genes.

BMC Plant Biol. 2024 Apr 10;24(1):260. doi: 10.1186/s12870-024-04988-8.

Space-efficient computation of k-mer dictionaries for large values of k.

Algorithms Mol Biol. 2024 Apr 5;19(1):14. doi: 10.1186/s13015-024-00259-1.

Chemical unclonable functions based on operable random DNA pools.

Nat Commun. 2024 Apr 5;15(1):2955. doi: 10.1038/s41467-024-47187-7.

Short-time AOIs-based representative scanpath identification and scanpath aggregation.

Behav Res Methods. 2024 Sep;56(6):6051-6066. doi: 10.3758/s13428-023-02332-w. Epub 2024 Jan 9.

APIPred: An XGBoost-Based Method for Predicting Aptamer-Protein Interactions.

J Chem Inf Model. 2024 Apr 8;64(7):2290-2301. doi: 10.1021/acs.jcim.3c00713. Epub 2023 Dec 21.

本文引用的文献

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads.

Sci Rep. 2017 May 31;7(1):2537. doi: 10.1038/s41598-017-02487-5.

KMC 3: counting and manipulating k-mer statistics.

Bioinformatics. 2017 Sep 1;33(17):2759-2761. doi: 10.1093/bioinformatics/btx304.

ntCard: a streaming algorithm for cardinality estimation in genomics data.

Bioinformatics. 2017 May 1;33(9):1324-1330. doi: 10.1093/bioinformatics/btw832.

Gerbil: a fast and memory-efficient -mer counter with GPU-support.

Algorithms Mol Biol. 2017 Mar 31;12:9. doi: 10.1186/s13015-017-0097-9. eCollection 2017.

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

Bioinformatics. 2017 Feb 15;33(4):574-576. doi: 10.1093/bioinformatics/btw663.

KCMBT: a k-mer Counter based on Multiple Burst Trees.

Bioinformatics. 2016 Sep 15;32(18):2783-90. doi: 10.1093/bioinformatics/btw345. Epub 2016 Jun 9.

Computational Performance Assessment of k-mer Counting Algorithms.

J Comput Biol. 2016 Apr;23(4):248-55. doi: 10.1089/cmb.2015.0199. Epub 2016 Mar 16.

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.

Brief Bioinform. 2017 Jan;18(1):1-8. doi: 10.1093/bib/bbw003. Epub 2016 Feb 10.

GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists.

Gigascience. 2015 Dec 3;4:58. doi: 10.1186/s13742-015-0097-y. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于高通量测序的 k-mer 计数方法的基准研究。

A benchmark study of k-mer counting methods for high-throughput sequencing.

机构信息

Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India.

出版信息

Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.

DOI:10.1093/gigascience/giy125

PMID:30346548

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6280066/

Abstract

摘要

用于高通量测序的 k-mer 计数方法的基准研究。

A benchmark study of k-mer counting methods for high-throughput sequencing.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于高通量测序的 k-mer 计数方法的基准研究。

A benchmark study of k-mer counting methods for high-throughput sequencing.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献