Ferro Eddie, Oliva Marco, Gagie Travis, Boucher Christina
Department of Computer and Information Science and Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, FL 32607, USA.
Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada.
iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.
Pangenomics alignment offers a solution to reduce bias in biomedical research. Traditionally, short-read aligners like Bowtie and BWA indexed a single reference genome to find approximate alignments. These methods, limited by linear-memory requirements, can only index a few genomes. Emerging pangenome aligners, such as VG, Giraffe, and Moni, address this by indexing more genomes. VG and Giraffe use a variation graph, while Moni indexes sequences accounting for repetition using prefix-free parsing to build a dictionary and parse. The main challenge is the parse's size, which becomes significantly larger than the dictionary. To scale Moni, we propose removing the parse from the construction of the run-length encoded BWT (RLBWT), suffix array, and Longest Common Prefix (LCP) by applying prefix-free parsing recursively. This approach improves construction time and memory requirements, enabling efficient construction of RLBWT, suffix array, and LCP for large pangenomes, such as those from the Human Pangenome Reference Consortium.
泛基因组比对为减少生物医学研究中的偏差提供了一种解决方案。传统上,像Bowtie和BWA这样的短读长比对工具会索引单个参考基因组以找到近似比对。这些方法受限于线性内存需求,只能索引少数几个基因组。新兴的泛基因组比对工具,如VG、Giraffe和Moni,通过索引更多基因组来解决这个问题。VG和Giraffe使用变异图,而Moni使用无前缀解析来索引考虑重复的序列,以构建字典并进行解析。主要挑战在于解析的大小,它会变得比字典大得多。为了扩展Moni,我们建议通过递归应用无前缀解析,在构建游程编码的BWT(RLBWT)、后缀数组和最长公共前缀(LCP)时去除解析。这种方法改善了构建时间和内存需求,能够为大型泛基因组(如人类泛基因组参考联盟的那些泛基因组)高效构建RLBWT、后缀数组和LCP。