挖掘统计上可靠的 k-mers 以实现准确的 NGS 错误纠正。

Mining statistically-solid k-mers for accurate NGS error correction.

机构信息

Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China.

School of Computing and Electronic Information, Guangxi University, Nanning, China.

出版信息

BMC Genomics. 2018 Dec 31;19(Suppl 10):912. doi: 10.1186/s12864-018-5272-y.

DOI:10.1186/s12864-018-5272-y

PMID:30598110

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6311904/

Abstract

BACKGROUND

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

RESULTS

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

CONCLUSION

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

摘要

背景

NGS 数据包含许多机器诱导的错误。最先进的纠错方法严重依赖于固体 k-mer 的选择。固体 k-mer 是在 NGS 读取中频繁出现的 k-mer。其他 k-mer 称为弱 k-mer。固体 k-mer 不太可能包含错误，而弱 k-mer 很可能包含错误。一个深入研究的问题是找到一个好的频率截止值 f 来平衡固体和弱 k-mer 的数量。一旦确定了截止值，一个更具挑战性但研究较少的问题是：（i）删除可能包含错误的一小部分固体 k-mer，（ii）将一小部分可能不包含错误的弱 k-mer 添加到剩余的固体 k-mer 集合中。识别这两个 k-mer 子集可以提高纠错性能。

结果

我们提出使用伽马分布来建模错误 k-mer 的频率，使用混合高斯分布来建模正确的 k-mer，并将它们结合起来确定 f。为了识别这两个特殊的 k-mer 子集，我们使用 k-mer 的 z 分数来衡量 k-mer 的频率与平均值的标准差数。然后，这些统计上可靠的 k-mer 用于构建用于纠错的布隆过滤器。我们的方法在真实和合成 NGS 数据集上的测试明显优于最先进的方法。

结论

z 分数足以区分固体 k-mer 和弱 k-mer，特别适用于精确定位频率非常低的固体 k-mer。在 k-mer 上应用 z 分数可以显著提高纠错准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfe0/6311904/467f778f32e4/12864_2018_5272_Fig1_HTML.jpg

相似文献

Mining statistically-solid k-mers for accurate NGS error correction.

BMC Genomics. 2018 Dec 31;19(Suppl 10):912. doi: 10.1186/s12864-018-5272-y.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.

Brief Bioinform. 2017 Jan;18(1):1-8. doi: 10.1093/bib/bbw003. Epub 2016 Feb 10.

SAKE: Strobemer-assisted k-mer extraction.

PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models.

Sci Rep. 2019 Nov 6;9(1):16157. doi: 10.1038/s41598-019-52196-4.

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

A random forest classifier for detecting rare variants in NGS data from viral populations.

Comput Struct Biotechnol J. 2017 Jul 19;15:388-395. doi: 10.1016/j.csbj.2017.07.001. eCollection 2017.

Turtle: identifying frequent k-mers with cache-efficient algorithms.

Bioinformatics. 2014 Jul 15;30(14):1950-7. doi: 10.1093/bioinformatics/btu132. Epub 2014 Mar 10.

Fast Approximation of Frequent -Mers and Applications to Metagenomics.

J Comput Biol. 2020 Apr;27(4):534-549. doi: 10.1089/cmb.2019.0314. Epub 2019 Dec 20.

Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach.

Nucleic Acids Res. 2021 Oct 11;49(18):e106. doi: 10.1093/nar/gkab610.

引用本文的文献

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

Predicting CTCF cell type active binding sites in human genome.

Sci Rep. 2024 Dec 30;14(1):31744. doi: 10.1038/s41598-024-82238-5.

Role of F-box E3-ubiquitin ligases in plant development and stress responses.

Plant Cell Rep. 2023 Jul;42(7):1133-1146. doi: 10.1007/s00299-023-03023-8. Epub 2023 May 17.

Comprehensive investigation of long non-coding RNAs in an endophytic fungus Calcarisporium arbuscula NRRL 3705.

Arch Microbiol. 2023 Mar 31;205(4):153. doi: 10.1007/s00203-023-03494-z.

The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms.

F1000Res. 2022 May 16;11:530. doi: 10.12688/f1000research.110194.1. eCollection 2022.

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6.

Gene Mining and Flavour Metabolism Analyses of Y-1 Isolated From a Chinese Liquor Fermentation Starter.

Front Microbiol. 2022 May 2;13:891387. doi: 10.3389/fmicb.2022.891387. eCollection 2022.

Dual UMIs and Dual Barcodes With Minimal PCR Amplification Removes Artifacts and Acquires Accurate Antibody Repertoire.

Front Immunol. 2021 Dec 22;12:778298. doi: 10.3389/fimmu.2021.778298. eCollection 2021.

Comparative analysis of 7 short-read sequencing platforms using the Korean Reference Genome: MGI and Illumina sequencing benchmark for whole-genome sequencing.

Gigascience. 2021 Mar 12;10(3). doi: 10.1093/gigascience/giab014.

Genetic characterization and population structure of different coat colour variants of Badri cattle.

Mol Biol Rep. 2020 Nov;47(11):8485-8497. doi: 10.1007/s11033-020-05890-w. Epub 2020 Oct 15.

本文引用的文献

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads.

Nat Methods. 2017 Nov;14(11):1072-1074. doi: 10.1038/nmeth.4432. Epub 2017 Sep 18.

MapReduce for accurate error correction of next-generation sequencing data.

Bioinformatics. 2017 Dec 1;33(23):3844-3851. doi: 10.1093/bioinformatics/btx089.

BLESS 2: accurate, memory-efficient and fast error correction method.

Bioinformatics. 2016 Aug 1;32(15):2369-71. doi: 10.1093/bioinformatics/btw146. Epub 2016 Mar 24.

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.

Genome Res. 2015 Nov;25(11):1750-6. doi: 10.1101/gr.191395.115. Epub 2015 Oct 7.

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

Lighter: fast and memory-efficient sequencing error correction without counting.

Genome Biol. 2014;15(11):509. doi: 10.1186/s13059-014-0509-9.

Correcting Illumina data.

Brief Bioinform. 2015 Jul;16(4):588-99. doi: 10.1093/bib/bbu029. Epub 2014 Sep 1.

Fiona: a parallel and automatic strategy for read error correction.

Bioinformatics. 2014 Sep 1;30(17):i356-63. doi: 10.1093/bioinformatics/btu440.

proovread: large-scale high-accuracy PacBio correction through iterative short read consensus.

Bioinformatics. 2014 Nov 1;30(21):3004-11. doi: 10.1093/bioinformatics/btu392. Epub 2014 Jul 10.

Blue: correcting sequencing errors using consensus and context.

Bioinformatics. 2014 Oct;30(19):2723-32. doi: 10.1093/bioinformatics/btu368. Epub 2014 Jun 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

挖掘统计上可靠的 k-mers 以实现准确的 NGS 错误纠正。

Mining statistically-solid k-mers for accurate NGS error correction.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献