EDAR:一种用于下一代测序数据的高效错误检测与去除算法。

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

作者信息

Zhao Xiaohong, Palmer Lance E, Bolanos Randall, Mircean Cristian, Fasulo Dan, Wittenberg Gayle M

机构信息

Siemens Corporate Research , Princeton, New Jersey, USA.

出版信息

J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.

Abstract

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

摘要

基因组测序技术会在测序读数中引入实验误差,这可能会误导序列组装工作并使诊断过程复杂化。在此,我们提出一种方法,用于在序列组装之前,从基因组鸟枪法测序项目产生的读数中检测并去除测序错误。对于每个输入读数,计算其包含的所有长度为k的子串(k-mer)集合。根据每个k-mer在完整数据集中出现的频率(k-count)对读数进行评估。对于每个读数,使用可变带宽均值漂移算法对k-mer进行聚类。根据聚类中心的k-count,将聚类分类为错误区域或非错误区域。对于测试的23个真实和模拟数据集(454和Solexa),我们的算法检测到的错误区域覆盖了所有错误的99%。然后应用一种启发式算法来检测每个假定错误区域中的错误位置。通过去除错误来校正读数,从而创建两个或更多更小的、无错误的读片段。在执行错误去除后,所有测试数据集的错误率都降低了(平均降低约35倍)。EDAR与纠正而非去除错误的方法具有相当的准确性,并且当模拟数据集的错误率大于3%时,它的表现更好。使用去除错误后的数据,Velvet组装器的性能通常更好。然而,对于短读数,在错误位置进行分割可能会有问题。采用错误检测并进行纠错而非去除错误,可能会改善组装结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索