Gulhan A Burak, Burhans Richard, Harris Robert, Kandemir Mahmut, Haeussler Maximilian, Nekrutenko Anton
Department of Computer Science and Engineering, Penn State University.
Department of Biochemistry and Molecular Biology, Penn State University.
bioRxiv. 2024 Sep 3:2024.09.02.610839. doi: 10.1101/2024.09.02.610839.
Our ability to generate sequencing data and assemble it into high quality complete genomes has rapidly advanced in recent years. These data promise to advance our understanding of organismal biology and answer longstanding evolutionary questions. Multiple genome alignment is a key tool in this quest. It is also the area which is lagging: today we can generate genomes faster than we can construct and update multiple alignments containing them. The bottleneck is in considerable computational time required to generate accurate pairwise alignments between divergent genomes, an unavoidable precursor to multiple alignments. This step is typically performed with lastZ, a very sensitive and yet equally slow tool. Here we describe an optimized GPU-enabled pairwise aligner KegAlign. It incorporates a new parallelization strategy, diagonal partitioning, with the latest features of modern GPUs. With KegAlign a typical human/mouse alignment can be computed in under 6 hours on a machine containing a single NVidia A100 GPU and 80 CPU cores without the need for any pre-partitioning of input sequences: a ~150× improvement over lastZ. While other pairwise aligners can complete this task in a fraction of that time, none achieves the sensitivity of KegAlign's main alignment engine, lastZ, and thus may not be suitable for comparing divergent genomes. In addition to providing the source code and a Conda package for KegAlign we also provide a Galaxy workflow that can be readily used by anyone.
近年来,我们生成测序数据并将其组装成高质量完整基因组的能力迅速提升。这些数据有望增进我们对生物生物学的理解,并解答长期存在的进化问题。多重基因组比对是实现这一目标的关键工具。然而,这也是目前进展滞后的领域:如今我们生成基因组的速度比构建和更新包含这些基因组的多重比对的速度更快。瓶颈在于生成不同基因组之间准确的两两比对所需的大量计算时间,而这是多重比对不可避免的前置步骤。这一步骤通常使用lastZ来执行,它是一个非常灵敏但同样缓慢的工具。在此,我们描述了一种经过优化的、支持GPU的两两比对工具KegAlign。它采用了一种新的并行化策略——对角线分区,并结合了现代GPU的最新特性。使用KegAlign,在一台配备单个英伟达A100 GPU和80个CPU核心的机器上,无需对输入序列进行任何预先分区,就能在6小时内完成典型的人类/小鼠比对:比lastZ快约150倍。虽然其他两两比对工具能在更短的时间内完成这项任务,但没有一个能达到KegAlign主要比对引擎lastZ的灵敏度,因此可能不适用于比较差异较大的基因组。除了提供KegAlign的源代码和Conda包外,我们还提供了一个任何人都能轻松使用的Galaxy工作流程。