IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2210-2222. doi: 10.1109/TCBB.2023.3240196. Epub 2023 Jun 5.
Range-join is an operation for finding overlaps in interval-form genomic data. Range-join is widely used in various genome analysis processes such as annotation, filtering and comparison of variants in whole-genome and exome analysis pipelines. The quadratic complexity of current algorithms with sheer data volume has surged the design challenges. Existing tools have limitations on algorithm efficiency, parallelism, scalability and memory consumption. This paper proposes BIndex, a novel bin-based indexing algorithm and its distributed implementation to attain high throughput range-join processing. BIndex features near-constant search complexity while the inherently parallel data structure facilitates exploitation of parallel computing architectures. Balanced partitioning of dataset further enables scalability on distributed frameworks. The implementation on Message Passing Interface shows upto 933.5x speedup in comparison to state-of-the-art tools. Parallel nature of BIndex further enables GPU-based acceleration with 3.72x speedup than CPU implementations. The add-in modules for Apache Spark provides upto 4.65x speedup than the previously best available tool. BIndex supports wide variety of input and output formats prevalent in bioinformatics community and the algorithm is easily extendable to streaming data in recent Big Data solutions. Furthermore, the index data structure is memory-efficient and consumes upto two orders-of-magnitude lesser RAM, while having no adverse effect on speedup.
范围连接是一种用于在区间形式的基因组数据中查找重叠的操作。范围连接在各种基因组分析过程中广泛使用,例如在全基因组和外显子组分析管道中的注释、变体过滤和比较。随着数据量的增加,当前算法的二次复杂度给设计带来了挑战。现有的工具在算法效率、并行性、可扩展性和内存消耗方面存在局限性。本文提出了 BIndex,这是一种新颖的基于桶的索引算法及其分布式实现,以实现高通量范围连接处理。BIndex 的特点是搜索复杂度接近常数,而固有的并行数据结构有利于利用并行计算架构。数据集的平衡分区进一步实现了分布式框架的可扩展性。在消息传递接口上的实现与最先进的工具相比,速度提高了 933.5 倍。BIndex 的并行性还使 GPU 加速提高了 3.72 倍,而 CPU 实现的速度提高了 3.72 倍。在 Apache Spark 中的附加模块比以前最好的可用工具提供了高达 4.65 倍的速度提升。BIndex 支持生物信息学社区中流行的各种输入和输出格式,并且算法易于扩展到最近的大数据解决方案中的流数据。此外,索引数据结构具有内存效率,消耗的 RAM 少两个数量级,而对加速没有不利影响。