Department of Computer Science, University of Helsinki, Helsinki, Finland.
PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.
K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.
基于 K -mer 的分析在许多生物信息学应用中起着重要作用,例如从头组装、测序错误校正和基因分型。为了充分利用这些方法,必须尽可能准确地捕获读取集的 K-mer 含量。通常更喜欢使用长 K-mer,因为它们可以与特定的基因组区域唯一相关。不幸的是,使用标准的精确 K-mer 计数方法无法可靠地从高错误率的读取中提取长 K-mer。我们提出了 SAKE,这是一种通过使用频闪器和通过部分有序对齐生成共识 K-mer 来从高错误率读取中提取长 K-mer 的方法。我们的实验表明,在高达 6%错误率的模拟数据上,SAKE 可以提取 97-mer,召回率超过 90%。相反,精确 K-mer 计数器 DSK 的召回率降至 20%以下。此外,SAKE 的精度与 DSK 相似。在真实的细菌数据上,SAKE 检索到 97-mer,召回率超过 90%,精度略低于 DSK,而 DSK 的召回率已经降至 50%。我们表明,与精确的 K-mer 计数相比,SAKE 可以从未经校正的高错误率读取中提取更多的 K-mer。然而,在未校正的读取上运行的精确 K-mer 计数器可以提取比在未校正的读取上运行的 SAKE 略多的 K-mer。