Computer Science Division, University of California-Berkeley, CA 94721, USA.
Genome Res. 2011 Jul;21(7):1181-92. doi: 10.1101/gr.111351.110. Epub 2011 Apr 11.
Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.
开发准确、可扩展的算法以提高数据质量是与高通量测序技术的最新进展相关的一个重要计算挑战。在本研究中,引入了一种称为 ECHO 的新型纠错算法,用于纠正短读序列中的碱基调用错误,而无需参考基因组。与大多数先前的方法不同,ECHO 不需要用户指定其最优值通常是未知的参数。ECHO 自动设置假设模型中的参数,并估计每个测序运行特有的误差特征,同时保持在实际使用范围内的运行时间。ECHO 基于概率模型,能够为每个校正后的碱基分配一个质量分数。此外,它明确地对二倍体基因组中的杂合性进行建模,并提供了一种无参考的方法来检测源自杂合位点的碱基。在真实和模拟数据上,ECHO 能够将以前的纠错方法的准确性提高几个数量级,具体取决于序列覆盖深度和读取位置。在读取的末尾,改进最为明显,此时以前的方法明显效果不佳。使用全基因组酵母数据集,本文证明 ECHO 能够应对非均匀覆盖。还表明,使用 ECHO 作为预处理步骤进行错误校正可以极大地促进从头组装,尤其是在低到中等序列覆盖深度的情况下。