Singh Noor Pratap, Khan Jamshed, Patro Rob
bioRxiv. 2025 Mar 12:2024.11.27.625771. doi: 10.1101/2024.11.27.625771.
Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
通过诸如伪比对等轻量级映射技术对短读段进行超快速映射,显著加速了转录组学和宏基因组学分析,与基于比对的方法相比,通常精度损失最小。然而,将伪比对应用于大型基因组参考序列,如染色体,由于其大小和重复序列而具有挑战性。我们引入了一种新的改进的伪比对方案,将每个参考序列划分为“虚拟颜色”…… 这些本质上是参考序列上固定最大范围的重叠区间,从伪比对算法的角度来看,它们被视为不同的“颜色”。我们在新工具alevin-fry-atac中应用这种改进的伪比对程序来处理和映射单细胞ATAC-seq数据。我们将alevin-fry-atac与Chromap和Cell Ranger ATAC进行比较。Alevin-fry-atac具有高度可扩展性,在使用32个线程时,比第二快的方法Chromap快约2.8倍,同时使用的内存约为其三分之一,并且映射的读段略多。从alevin-fry-atac生成的峰和簇与从Chromap和Cell Ranger ATAC流程获得的峰和簇高度一致,表明直接对基因组进行虚拟颜色增强的伪比对为单细胞ATAC-seq处理的现有方法提供了一种快速、节省内存且准确的替代方案。alevin-fry-atac的开发将单细胞ATAC-seq处理带入了一个与单细胞RNA-seq处理(通过alevin-fry)统一的生态系统,致力于为CellRanger的许多不同功能提供一个真正开放的替代方案。此外,我们改进的伪比对方法应该很容易应用并扩展到其他以基因组为中心的基于映射的任务和模式,如标准DNA-seq、DNase-seq、Chip-seq和Hi-C。