SAKE：频闪辅助 k-mer 提取。

SAKE: Strobemer-assisted k-mer extraction.

机构信息

Department of Computer Science, University of Helsinki, Helsinki, Finland.

出版信息

PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.

DOI:10.1371/journal.pone.0294415

PMID:38019768

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10686461/

Abstract

K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.

摘要

基于 K -mer 的分析在许多生物信息学应用中起着重要作用，例如从头组装、测序错误校正和基因分型。为了充分利用这些方法，必须尽可能准确地捕获读取集的 K-mer 含量。通常更喜欢使用长 K-mer，因为它们可以与特定的基因组区域唯一相关。不幸的是，使用标准的精确 K-mer 计数方法无法可靠地从高错误率的读取中提取长 K-mer。我们提出了 SAKE，这是一种通过使用频闪器和通过部分有序对齐生成共识 K-mer 来从高错误率读取中提取长 K-mer 的方法。我们的实验表明，在高达 6%错误率的模拟数据上，SAKE 可以提取 97-mer，召回率超过 90%。相反，精确 K-mer 计数器 DSK 的召回率降至 20%以下。此外，SAKE 的精度与 DSK 相似。在真实的细菌数据上，SAKE 检索到 97-mer，召回率超过 90%，精度略低于 DSK，而 DSK 的召回率已经降至 50%。我们表明，与精确的 K-mer 计数相比，SAKE 可以从未经校正的高错误率读取中提取更多的 K-mer。然而，在未校正的读取上运行的精确 K-mer 计数器可以提取比在未校正的读取上运行的 SAKE 略多的 K-mer。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/671c/10686461/914e420c4a34/pone.0294415.g001.jpg

相似文献

SAKE: Strobemer-assisted k-mer extraction.SAKE：频闪辅助 k-mer 提取。

PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.

DSK: k-mer counting with very low memory usage.DSK：使用极低内存进行 k-mer 计数。

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

Extraction of long k-mers using spaced seeds.使用间隔种子提取长k-mer

IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep 16;PP. doi: 10.1109/TCBB.2021.3113131.

Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。

Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.这些不是你要找的k-mer：使用概率数据结构进行高效在线k-mer计数。

PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271. eCollection 2014.

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.Lerna：用于配置短读和长读基因组测序错误纠正工具的变压器架构。

BMC Bioinformatics. 2022 Jan 6;23(1):25. doi: 10.1186/s12859-021-04547-0.

Squeakr: an exact and approximate k-mer counting system.Squeakr：一种精确和近似的 k-mer 计数系统。

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.一种通用的、近精确的低内存消耗 k-mer 计数方法，可在 2.7 小时内完成 106×人类序列数据的从头组装。

Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.

QuorUM: An Error Corrector for Illumina Reads.QuorUM：Illumina测序读数的纠错工具

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。

BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.

本文引用的文献

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads.多重 de Bruijn 图可从长的、高保真的读取中进行基因组组装。

Nat Biotechnol. 2022 Jul;40(7):1075-1081. doi: 10.1038/s41587-022-01220-6. Epub 2022 Feb 28.

Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。

Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.

New strategies to improve minimap2 alignment accuracy.提高 minimap2 比对准确性的新策略。

Bioinformatics. 2021 Dec 7;37(23):4572-4574. doi: 10.1093/bioinformatics/btab705.

Extraction of long k-mers using spaced seeds.使用间隔种子提取长k-mer

IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep 16;PP. doi: 10.1109/TCBB.2021.3113131.

Twelve years of SAMtools and BCFtools.SAMtools 和 BCFtools 十二年。

Gigascience. 2021 Feb 16;10(2). doi: 10.1093/gigascience/giab008.

MBG: Minimizer-based sparse de Bruijn Graph construction.MBG：基于最小化器的稀疏德布鲁因图构建。

Bioinformatics. 2021 Aug 25;37(16):2476-2478. doi: 10.1093/bioinformatics/btab004.

Scalable long read self-correction and assembly polishing with multiple sequence alignment.可扩展的长读自我纠错和多重序列比对的组装优化。

Sci Rep. 2021 Jan 12;11(1):761. doi: 10.1038/s41598-020-80757-5.

Efficient assembly of nanopore reads via highly accurate and intact error correction.通过高度准确和完整的纠错实现纳米孔读取的高效组装。

Nat Commun. 2021 Jan 4;12(1):60. doi: 10.1038/s41467-020-20236-7.

Fast and accurate long-read assembly with wtdbg2.使用 wtdbg2 实现快速准确的长读长序列组装。

Nat Methods. 2020 Feb;17(2):155-158. doi: 10.1038/s41592-019-0669-3. Epub 2019 Dec 9.

A benchmark study of k-mer counting methods for high-throughput sequencing.用于高通量测序的 k-mer 计数方法的基准研究。

Gigascience. 2018 Dec 1;7(12):giy125. doi: 10.1093/gigascience/giy125.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SAKE：频闪辅助 k-mer 提取。

SAKE: Strobemer-assisted k-mer extraction.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献