Rachid Maan Haj, Malluhi Qutaibah, Abouelhoda Mohamed
KINDI Lab for Computing Research, Qatar University P.O. Box 2713, Doha, Qatar.
Faculty of Engineering, Cairo University, Giza, Egypt ; Center for Informatics Sciences, Nile University, Giza, Egypt.
Biomed Res Int. 2014;2014:745298. doi: 10.1155/2014/745298. Epub 2014 Apr 16.
The all-pairs suffix-prefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a space-economical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane's compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.
全对后缀-前缀匹配问题是字符串处理中的一个基本问题。它在从头基因组组装任务中有应用,而从头基因组组装任务是主要的生物信息学问题之一。由于输入数据规模巨大,使用快速且节省空间的解决方案至关重要。在本文中,我们使用广义的笹兼压缩后缀树提出了一种针对此问题的节省空间的解决方案。此外,我们提出了一种并行算法,为共享内存计算机提供更高的速度。我们的顺序和并行算法通过利用笹兼压缩索引数据结构的特性进行了优化。实验结果表明,我们基于笹兼压缩索引的解决方案比基于后缀树和增强后缀数组等非压缩数据结构的解决方案消耗的空间要少得多。我们的实验结果表明,我们的并行算法效率高,并且随着处理器数量的增加扩展性良好。