基于k谱的下一代测序数据分析纠错方法的比较研究。

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

作者信息

Akogwu Isaac, Wang Nan, Zhang Chaoyang, Gong Ping

机构信息

School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.

Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.

出版信息

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

DOI:10.1186/s40246-016-0068-0

PMID:27461106

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4965716/

Abstract

BACKGROUND

Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.

METHODS

Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method.

RESULTS

Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets.

CONCLUSIONS

This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.

摘要

背景

高通量新一代测序（NGS）技术的进步催生了无数新的基因组研究机会。然而，NGS数据丰富带来的一个问题是，在下游分析过程中难以区分真正的生物学变异和序列错误改变。人们已经开发了许多纠错方法，用于在进一步分析之前纠正错误的NGS读段，但对于读长、基因组大小和覆盖深度等数据集特征对其性能的影响，缺乏独立评估。这项比较研究旨在探究一些最新的基于k谱的方法的优缺点及局限性，并为用户根据特定的NGS数据集选择合适的方法提供建议。

方法

使用六组模拟的双端Illumina测序数据，对六种基于k谱的方法，即Reptile、Musket、Bless、Bloocoo、Lighter和Trowel进行比较。这些NGS数据集在读长（36至100bp）、覆盖深度（10×至120×）和基因组大小（4.6至143MB）方面各不相同。使用纠错评估工具包（ECET）得出一套指标（即真阳性、假阳性、假阴性、召回率、精确率、增益和F值），以评估每种方法的校正质量。

结果

计算实验结果表明，在六个数据集中所检测的变异范围内，Musket的整体性能最佳。Musket的最低准确率（F值=0.81）出现在一个中等读长（56bp）、中等覆盖度（50×）和小基因组大小（5.4MB）的数据集上。其他五种方法表现较差（F值<0.80）和/或无法处理一个或多个数据集。

结论

本研究表明，覆盖深度、读长和基因组大小等多种因素可能会影响基于k谱的单个纠错方法的性能。因此，必须努力为特定的NGS数据集选择合适的纠错方法。基于我们的比较研究，我们推荐Musket作为首选，因为它在所有六个测试数据集中都始终具有卓越的性能。有必要进行进一步的广泛研究，以在更多样化的参数设置（k-mer值和编辑距离）下，使用NGS平台（如454、SOLiD和Ion Torrent）生成的实验数据集评估这些方法，并将它们与其他非基于k谱的纠错方法类别进行比较。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e326/4965716/178bc1ef1613/40246_2016_68_Fig1_HTML.jpg

相似文献

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models.

Sci Rep. 2019 Nov 6;9(1):16157. doi: 10.1038/s41598-019-52196-4.

QuorUM: An Error Corrector for Illumina Reads.

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Trowel: a fast and accurate error correction module for Illumina sequencing reads.

Bioinformatics. 2014 Nov 15;30(22):3264-5. doi: 10.1093/bioinformatics/btu513. Epub 2014 Jul 29.

Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data.

Bioinformatics. 2013 Feb 1;29(3):308-15. doi: 10.1093/bioinformatics/bts690. Epub 2012 Nov 29.

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

BMC Bioinformatics. 2022 Jun 13;23(1):227. doi: 10.1186/s12859-022-04754-3.

Evaluation of nine popular de novo assemblers in microbial genome assembly.

J Microbiol Methods. 2017 Dec;143:32-37. doi: 10.1016/j.mimet.2017.09.008. Epub 2017 Sep 19.

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.

BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.

HISEA: HIerarchical SEed Aligner for PacBio data.

BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.

SeqAssist: a novel toolkit for preliminary analysis of next-generation sequencing data.

BMC Bioinformatics. 2014;15 Suppl 11(Suppl 11):S10. doi: 10.1186/1471-2105-15-S11-S10. Epub 2014 Oct 21.

引用本文的文献

HPTAS: An Alignment-Free Haplotype Phasing Algorithm Focused on Allele-Specific Studies Using Transcriptome Data.

Int J Mol Sci. 2025 Jun 13;26(12):5700. doi: 10.3390/ijms26125700.

An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies.

BMC Genomics. 2024 Jan 9;25(1):45. doi: 10.1186/s12864-023-09910-4.

K-Mer Spectrum-Based Error Correction Algorithm for Next-Generation Sequencing Data.

Comput Intell Neurosci. 2022 Jul 14;2022:8077664. doi: 10.1155/2022/8077664. eCollection 2022.

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.

BMC Bioinformatics. 2022 Jan 6;23(1):25. doi: 10.1186/s12859-021-04547-0.

ntEdit: scalable genome sequence polishing.

Bioinformatics. 2019 Nov 1;35(21):4430-4432. doi: 10.1093/bioinformatics/btz400.

Molecular characterization of an unauthorized genetically modified Bacillus subtilis production strain identified in a vitamin B feed additive.

Food Chem. 2017 Sep 1;230:681-689. doi: 10.1016/j.foodchem.2017.03.042. Epub 2017 Mar 9.

本文引用的文献

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

Lighter: fast and memory-efficient sequencing error correction without counting.

Genome Biol. 2014;15(11):509. doi: 10.1186/s13059-014-0509-9.

Trowel: a fast and accurate error correction module for Illumina sequencing reads.

Bioinformatics. 2014 Nov 15;30(22):3264-5. doi: 10.1093/bioinformatics/btu513. Epub 2014 Jul 29.

GATB: Genome Assembly & Analysis Tool Box.

Bioinformatics. 2014 Oct 15;30(20):2959-61. doi: 10.1093/bioinformatics/btu406. Epub 2014 Jul 1.

BLESS: bloom filter-based error correction solution for high-throughput sequencing reads.

Bioinformatics. 2014 May 15;30(10):1354-62. doi: 10.1093/bioinformatics/btu030. Epub 2014 Jan 21.

Informed and automated k-mer size selection for genome assembly.

Bioinformatics. 2014 Jan 1;30(1):31-7. doi: 10.1093/bioinformatics/btt310. Epub 2013 Jun 3.

Probabilistic error correction for RNA sequencing.

Nucleic Acids Res. 2013 May 1;41(10):e109. doi: 10.1093/nar/gkt215. Epub 2013 Apr 4.

DSK: k-mer counting with very low memory usage.

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data.

Bioinformatics. 2013 Feb 1;29(3):308-15. doi: 10.1093/bioinformatics/bts690. Epub 2012 Nov 29.

CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform.

Bioinformatics. 2012 Jul 15;28(14):1830-7. doi: 10.1093/bioinformatics/bts276. Epub 2012 May 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于k谱的下一代测序数据分析纠错方法的比较研究。

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献