EDAR：一种用于下一代测序数据的高效错误检测与去除算法。

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

作者信息

Zhao Xiaohong, Palmer Lance E, Bolanos Randall, Mircean Cristian, Fasulo Dan, Wittenberg Gayle M

机构信息

Siemens Corporate Research , Princeton, New Jersey, USA.

出版信息

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

DOI:10.1089/cmb.2010.0127

PMID:20973743

Abstract

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

摘要

基因组测序技术会在测序读数中引入实验误差，这可能会误导序列组装工作并使诊断过程复杂化。在此，我们提出一种方法，用于在序列组装之前，从基因组鸟枪法测序项目产生的读数中检测并去除测序错误。对于每个输入读数，计算其包含的所有长度为k的子串（k-mer）集合。根据每个k-mer在完整数据集中出现的频率（k-count）对读数进行评估。对于每个读数，使用可变带宽均值漂移算法对k-mer进行聚类。根据聚类中心的k-count，将聚类分类为错误区域或非错误区域。对于测试的23个真实和模拟数据集（454和Solexa），我们的算法检测到的错误区域覆盖了所有错误的99%。然后应用一种启发式算法来检测每个假定错误区域中的错误位置。通过去除错误来校正读数，从而创建两个或更多更小的、无错误的读片段。在执行错误去除后，所有测试数据集的错误率都降低了（平均降低约35倍）。EDAR与纠正而非去除错误的方法具有相当的准确性，并且当模拟数据集的错误率大于3%时，它的表现更好。使用去除错误后的数据，Velvet组装器的性能通常更好。然而，对于短读数，在错误位置进行分割可能会有问题。采用错误检测并进行纠错而非去除错误，可能会改善组装结果。

相似文献

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

Correction of sequencing errors in a mixed set of reads.

Bioinformatics. 2010 May 15;26(10):1284-90. doi: 10.1093/bioinformatics/btq151. Epub 2010 Apr 8.

SHREC: a short-read error correction method.

Bioinformatics. 2009 Sep 1;25(17):2157-63. doi: 10.1093/bioinformatics/btp379. Epub 2009 Jun 19.

A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware.

J Comput Biol. 2010 Apr;17(4):603-15. doi: 10.1089/cmb.2009.0062.

QuorUM: An Error Corrector for Illumina Reads.

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Analysis of high-throughput sequencing data.

Methods Mol Biol. 2011;678:1-11. doi: 10.1007/978-1-60761-682-5_1.

Recount: expectation maximization based error correction tool for next generation sequencing data.

Genome Inform. 2009 Oct;23(1):189-201.

Microindel detection in short-read sequence data.

Bioinformatics. 2010 Mar 15;26(6):722-9. doi: 10.1093/bioinformatics/btq027. Epub 2010 Feb 9.

Reptile: representative tiling for short read error correction.

Bioinformatics. 2010 Oct 15;26(20):2526-33. doi: 10.1093/bioinformatics/btq468. Epub 2010 Aug 16.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

引用本文的文献

Methods to improve the accuracy of next-generation sequencing.

Front Bioeng Biotechnol. 2023 Jan 20;11:982111. doi: 10.3389/fbioe.2023.982111. eCollection 2023.

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads.

Sci Rep. 2017 May 31;7(1):2537. doi: 10.1038/s41598-017-02487-5.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

Reproducibility of Illumina platform deep sequencing errors allows accurate determination of DNA barcodes in cells.

BMC Bioinformatics. 2016 Apr 2;17:151. doi: 10.1186/s12859-016-0999-4.

QuorUM: An Error Corrector for Illumina Reads.

PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

Brief Bioinform. 2016 Jan;17(1):154-79. doi: 10.1093/bib/bbv029. Epub 2015 May 29.

Analysis of the evolution and structure of a complex intrahost viral population in chronic hepatitis C virus mapped by ultradeep pyrosequencing.

J Virol. 2014 Dec;88(23):13709-21. doi: 10.1128/JVI.01732-14. Epub 2014 Sep 17.

DRISEE overestimates errors in metagenomic sequencing data.

Brief Bioinform. 2014 Sep;15(5):783-7. doi: 10.1093/bib/bbt010. Epub 2013 May 22.

Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.

Sci Rep. 2013;3:1512. doi: 10.1038/srep01512.

Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data.

Front Microbiol. 2012 Sep 11;3:329. doi: 10.3389/fmicb.2012.00329. eCollection 2012.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

EDAR：一种用于下一代测序数据的高效错误检测与去除算法。

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献