EDAR:一种用于下一代测序数据的高效错误检测与去除算法。
EDAR: an efficient error detection and removal algorithm for next generation sequencing data.
作者信息
Zhao Xiaohong, Palmer Lance E, Bolanos Randall, Mircean Cristian, Fasulo Dan, Wittenberg Gayle M
机构信息
Siemens Corporate Research , Princeton, New Jersey, USA.
出版信息
J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25.
Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.
基因组测序技术会在测序读数中引入实验误差,这可能会误导序列组装工作并使诊断过程复杂化。在此,我们提出一种方法,用于在序列组装之前,从基因组鸟枪法测序项目产生的读数中检测并去除测序错误。对于每个输入读数,计算其包含的所有长度为k的子串(k-mer)集合。根据每个k-mer在完整数据集中出现的频率(k-count)对读数进行评估。对于每个读数,使用可变带宽均值漂移算法对k-mer进行聚类。根据聚类中心的k-count,将聚类分类为错误区域或非错误区域。对于测试的23个真实和模拟数据集(454和Solexa),我们的算法检测到的错误区域覆盖了所有错误的99%。然后应用一种启发式算法来检测每个假定错误区域中的错误位置。通过去除错误来校正读数,从而创建两个或更多更小的、无错误的读片段。在执行错误去除后,所有测试数据集的错误率都降低了(平均降低约35倍)。EDAR与纠正而非去除错误的方法具有相当的准确性,并且当模拟数据集的错误率大于3%时,它的表现更好。使用去除错误后的数据,Velvet组装器的性能通常更好。然而,对于短读数,在错误位置进行分割可能会有问题。采用错误检测并进行纠错而非去除错误,可能会改善组装结果。