School of Information, Yunnan University, KunMing, China.
Yunnan Physical Science and Sports Professional College, KunMing, China.
BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.
A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms.
We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs.
Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.
数十年来,大量研究人员致力于提高基因组测序速度和降低基因组测序成本,在这两个领域都取得了重大进展,使研究人员更容易研究和分析基因组数据。然而,如何有效地存储和传输高通量测序技术生成的大量基因组数据,已成为数据压缩研究人员面临的挑战。因此,研究基因组数据压缩算法以促进基因组数据的高效表示逐渐引起了这些研究人员的关注。同时,考虑到当前的计算设备具有多个内核,如何充分利用计算设备的优势并提高并行处理的效率,也是设计基因组压缩算法的一个重要方向。
我们提出了一种基于参考基因组序列的算法(LMSRGC),用于压缩 FASTA 格式的基因组数据。该算法使用后缀数组(SA)和最长公共前缀(LCP)数组来查找最长匹配子字符串(LMS),以实现对基因组数据的压缩。该算法利用 SA 和 LCP 数组的特点,选择压缩基因组序列和参考基因组序列之间的所有合适的 LMS,然后利用 LMS 来压缩目标基因组序列。为了加速算法的运行,我们使用 GPU 并行化 SA 的构建,同时使用多个线程并行化 LCP 数组的创建和 LMS 的过滤。
实验结果表明,我们的算法在压缩比和压缩时间方面与当前最先进的算法具有竞争力。