Department of Electrical and Computer Engineering, Iowa State University, Ames IA 50011, USA.
Bioinformatics. 2010 Oct 15;26(20):2526-33. doi: 10.1093/bioinformatics/btq468. Epub 2010 Aug 16.
MOTIVATION: Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets. RESULTS: We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes. AVAILABILITY: Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software CONTACT: aluru@iastate.edu.
动机:错误纠正对于下一代测序应用程序(如重测序和从头基因组测序)的成功至关重要。对于高通量短读测序来说,这一点尤其重要,因为在这种测序中,读取片段更短、更丰富,而且错误比传统的 Sanger 测序更频繁。使用现有的错误纠正方法处理大量的短读序列既需要大量的计算资源,又需要大量的内存,但是当应用于真实数据集时,结果远不能令人满意。
结果:我们提出了一种新的方法,称为 Reptile,用于纠正下一代测序中短读数据中的错误。Reptile 利用输入读取的 k-mer 谱工作,并通过同时检查以下两种情况来纠正错误:(i)基于汉明距离的潜在错误 k-mer 的纠正可能性;以及(ii)来自同一读取的相邻 k-mer 的正确上下文信息。由于不需要存储输入数据,Reptile 具有一个有利的特性,即它可以处理不在主内存中的数据。除了序列数据之外,Reptile 还可以利用可用的质量评分信息。我们的实验表明,Reptile 在从数据中去除错误的百分比和真碱基赋值的准确性方面都优于以前的方法。此外,与以前的方法相比,运行时间和内存使用量都有显著减少,使得在采样更大的基因组时,短读序列的错误纠正更加实用。
可用性:Reptile 是用 C++ 实现的,可以通过以下链接获得:http://aluru-sun.ece.iastate.edu/doku.php?id=software
联系信息:aluru@iastate.edu.
Bioinformatics. 2010-8-16
Brief Bioinform. 2009-11
Bioinformatics. 2010-2-9
Bioinformatics. 2009-9-1
Methods Mol Biol. 2011
Bioinformatics. 2010-4-8
Bioinformatics. 2008-8-15
Mol Biol Evol. 2009-8-25
Front Bioeng Biotechnol. 2023-1-20
BMC Bioinformatics. 2022-11-7
Mol Biol Rep. 2022-11
Comput Intell Neurosci. 2022
BMC Genomics. 2020-12-21
BMC Genomics. 2018-12-31