Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011, USA.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S52. doi: 10.1186/1471-2105-12-S1-S52.
High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.
We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content.
The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem".
We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
高通量短读测序通过实现经济高效的基因组和转录组深度覆盖测序,正在彻底改变基因组学和系统生物学研究。错误检测和纠正对于许多短读测序应用至关重要,包括从头基因组测序、基因组重测序和数字基因表达分析。短读错误检测通常通过计算读段中观测到的 kmers 的频率并验证那些频率超过阈值的 kmer 来完成。在具有高重复含量的基因组中,如果一个 kmer 与基因组中多次出现的具有多个核苷酸差异的有效 kmer 非常相似,则可能会频繁观察到错误的 kmer。错误检测和纠正主要应用于低重复含量的基因组,而对于高重复含量的基因组,这仍然是一个具有挑战性的问题。
我们开发了一种统计模型和一种在存在基因组重复的情况下进行错误检测和纠正的计算方法。我们提出了一种从观察到的 kmers 的观察频率推断 kmers 的基因组频率的方法,通过分析观察到的 kmers 之间的误读关系。我们还提出了一种估计阈值的方法,该阈值可用于验证估计的基因组频率超过该阈值的 kmers。我们证明了使用这些方法可以实现更好的错误检测。此外,我们打破了在一个读段内错误均匀分布的常见假设,并提供了一种框架,用于建模许多短读平台常见的位置相关错误发生频率。最后,我们在具有高重复含量的基因组中实现了更好的错误纠正。
该软件是用 C++实现的,可在“http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”以 GNU GPL3 许可证和 Boost Software V1.0 许可证免费获得。
我们引入了一种统计框架来模拟下一代读取中的测序错误,这为检测和纠正具有高重复含量的基因组中的错误提供了有前景的结果。