Li Heng
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.
Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.
We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 h and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.
Burrows-Wheeler变换(BWT)是全文索引中的常见组件。它最初是为数据压缩而开发的,对于编码冗余序列(如泛基因组数据)特别有效。然而,BWT构建资源密集且难以并行化,并且许多查询大型全文索引的方法仅报告精确匹配或其简单扩展。这些限制阻碍了全文索引在生物学中的应用。
我们开发了ropebwt3用于高效的BWT构建和查询。ropebwt3在65小时内索引了320个组装好的人类基因组,并在26天内索引了7.3万亿碱基的常用细菌组装数据。这是在不使用工作磁盘空间的情况下,通过峰值时使用高达170GB的内存实现的。ropebwt3可以在仿射间隙罚分下找到最大精确匹配和不精确比对,并且可以检索与查询序列匹配的相似局部单倍型。它证明了在万亿碱基规模上进行全文索引的可行性。