Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
BMC Bioinformatics. 2012 Nov 13;13:297. doi: 10.1186/1471-2105-13-297.
Despite significant advancement in alignment algorithms, the exponential growth of nucleotide sequencing throughput threatens to outpace bioinformatic analysis. Computation may become the bottleneck of genome analysis if growing alignment costs are not mitigated by further improvement in algorithms. Much gain has been gleaned from indexing and compressing alignment databases, but many widely used alignment tools process input reads sequentially and are oblivious to any underlying redundancy in the reads themselves.
Here we present Oculus, a software package that attaches to standard aligners and exploits read redundancy by performing streaming compression, alignment, and decompression of input sequences. This nearly lossless process (> 99.9%) led to alignment speedups of up to 270% across a variety of data sets, while requiring a modest amount of memory. We expect that streaming read compressors such as Oculus could become a standard addition to existing RNA-Seq and ChIP-Seq alignment pipelines, and potentially other applications in the future as throughput increases.
Oculus efficiently condenses redundant input reads and wraps existing aligners to provide nearly identical SAM output in a fraction of the aligner runtime. It includes a number of useful features, such as tunable performance and fidelity options, compatibility with FASTA or FASTQ files, and adherence to the SAM format. The platform-independent C++ source code is freely available online, at http://code.google.com/p/oculus-bio.
尽管在比对算法方面取得了重大进展,但核苷酸测序通量的指数级增长有可能超过生物信息分析的速度。如果不能通过进一步改进算法来减轻不断增长的比对成本,那么计算可能会成为基因组分析的瓶颈。通过索引和压缩比对数据库已经获得了很多收益,但许多广泛使用的比对工具都是按顺序处理输入读取,而忽略了读取本身的任何潜在冗余。
在这里,我们提出了 Oculus,这是一个软件包,可以附加到标准比对器上,并通过对输入序列进行流压缩、比对和解压缩来利用读取冗余。这个几乎无损的过程(>99.9%)导致在各种数据集上的比对速度提高了 270%,而只需要少量的内存。我们预计,像 Oculus 这样的流式读取压缩器可能会成为 RNA-Seq 和 ChIP-Seq 比对管道的标准附加组件,并随着通量的增加,在未来可能会成为其他应用的标准组件。
Oculus 有效地压缩了冗余的输入读取,并包装了现有的比对器,以便在比对器运行时间的一小部分内提供几乎相同的 SAM 输出。它包括许多有用的功能,如可调整的性能和保真度选项、与 FASTA 或 FASTQ 文件的兼容性,以及对 SAM 格式的遵守。这个与平台无关的 C++源代码可以在网上免费获得,网址是 http://code.google.com/p/oculus-bio。