Greenfield Paul, Duesing Konsta, Papanicolaou Alexie, Bauer Denis C
CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia.
CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia.
Bioinformatics. 2014 Oct;30(19):2723-32. doi: 10.1093/bioinformatics/btu368. Epub 2014 Jun 11.
Bioinformatics tools, such as assemblers and aligners, are expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of stand-alone tools whose sole purpose is to detect and remove sequencing errors. A good error-correcting tool would be a transparent component in a bioinformatics pipeline, simply taking sequence data in any of the standard formats and producing a higher quality version of the same data containing far fewer errors. It should not only be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), but it has to be both fast enough and scalable enough to be usable on the large datasets being produced by current sequencing technologies, and work on data derived from both haploid and diploid organisms.
This article presents Blue, an error-correction algorithm based on k-mer consensus and context. Blue can correct substitution, deletion and insertion errors, as well as uncalled bases. It accepts both FASTQ and FASTA formats, and corrects quality scores for corrected bases. Blue also maintains the pairing of reads, both within a file and between pairs of files, making it compatible with downstream tools that depend on read pairing. Blue is memory efficient, scalable and faster than other published tools, and usable on large sequencing datasets. On the tests undertaken, Blue also proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors. One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.
The code for Blue and its related tools are available from http://www.bioinformatics.csiro.au/Blue. These programs are written in C# and run natively under Windows and under Mono on Linux.
生物信息学工具,如序列组装器和比对器,若以质量更高的序列数据作为起点,有望产生更准确的结果。这种期望促使了一些独立工具的开发,其唯一目的是检测和去除测序错误。一个优秀的纠错工具应是生物信息学流程中的一个透明组件,只需接受任何标准格式的序列数据,并生成同一数据的高质量版本,其中错误要少得多。它不仅应能够纠正真实序列数据中发现的所有类型的错误(替换、插入、缺失和未调用碱基),还必须足够快且可扩展,以便能用于当前测序技术产生的大型数据集,并处理来自单倍体和二倍体生物的数据。
本文介绍了Blue,一种基于k-mer一致性和上下文的纠错算法。Blue可以纠正替换、缺失和插入错误,以及未调用碱基。它接受FASTQ和FASTA两种格式,并为校正后的碱基校正质量分数。Blue还会保持文件内以及文件对之间的读段配对,使其与依赖读段配对的下游工具兼容。Blue内存高效、可扩展且比其他已发布的工具更快,可用于大型测序数据集。在所进行的测试中,Blue也被证明通常比其他已发布的算法更准确,从而使读段比对更准确,且能组装出包含更少错误的更长重叠群。Blue的一个显著特点是其k-mer一致性表不必从要校正的读段集中得出。这种解耦使得用来自另一个数据集(如来自同一DNA样本的Illumina读段)得出的一致性来校正一个数据集(如一小套454配对末端读段)成为可能。这种交叉校正可以极大地提高小(且昂贵)的长读段集的质量,从而带来更好的组装效果和更高质量的完整基因组。
Blue及其相关工具的代码可从http://www.bioinformatics.csiro.au/Blue获取。这些程序用C#编写,可在Windows下原生运行,也可在Linux上通过Mono运行。