Egidi Lavinia, Louza Felipe A, Manzini Giovanni, Telles Guilherme P
1DiSIT, University of Eastern Piedmont, Viale Michel, 11, 15121 Alessandria, Italy.
2Department of Computing and Mathematics, University of São Paulo, Av. Bandeirantes, 3900, 14040-901 Ribeirão Preto, Brazil.
Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.
Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.
We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.
We prove that our algorithm performs sequential I/Os, where is the total length of the collection and is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
测序技术产生的生物序列集合越来越大,必须存储在支持快速搜索操作的压缩索引中。许多压缩索引基于Burrows-Wheeler变换(BWT)和最长公共前缀(LCP)数组。由于输入数据量巨大,以最佳方式利用可用随机存取存储器(RAM)在外部存储器中高效地构建这些数据结构非常重要。
我们提出了一种空间高效的算法,用于在外部或半外部存储器设置下计算序列集合的BWT和LCP数组。我们的算法将输入集合拆分为足够小的子集合,以便能够使用最优线性时间算法在RAM中计算它们的BWT。接下来,它在外部或半外部存储器中合并部分BWT,并在此过程中计算LCP值。我们的算法可以修改为输出另外两个数组,与BWT和LCP数组结合,为生物信息学中的三个著名问题提供简单的、基于扫描的外部存储器算法:最大重复序列的计算、所有后缀-前缀重叠对的计算以及简洁德布鲁因图的构建。
我们证明我们的算法执行 次顺序输入/输出操作,其中 是集合的总长度, 是最大LCP值。实验结果表明,对于短序列,我们的算法仅比现有技术稍慢,但对于长序列或当可用RAM至少等于输入大小时,它快达40倍。