Semwal Ayush, Morrison Jacob, Beddows Ian, Palmer Theron, Majewski Mary F, Jang H Josh, Johnson Benjamin K, Shen Hui
bioRxiv. 2025 Jul 31:2025.07.25.666829. doi: 10.1101/2025.07.25.666829.
Long-read single-cell RNA sequencing using platforms such as Oxford Nanopore Technologies (ONT) enables full-length transcriptome profiling at single-cell resolution. However, high sequencing error rates, diverse library architectures, and increasing dataset scale introduce major challenges for accurately identifying cell barcodes (CBCs) and unique molecular identifiers (UMIs) - key prerequisites for reliable demultiplexing and deduplication, respectively. Existing pipelines rely on hard-coded heuristics or local transition rules that cannot fully capture this broader structural context and often fail to robustly interpret reads with indel-induced shifts, truncated segments, or non-canonical element ordering. We introduce (TRANscript QUantification In Long reads-anaLYZER), a flexible, architecture-aware deep learning framework for processing long-read single-cell RNA-seq data. employs a hybrid neural network architecture and a global, context-aware design, and enables precise identification of structural elements - even when elements are shifted, partially degraded, or repeated due to sequencing noise or library construction variability. In addition to supporting established single-cell protocols, accommodates custom library formats through rapid, one-time model training on user-defined label schemas, typically completed within a few hours on standard GPUs. Additional features such as scalability across large datasets and comprehensive visualization capabilities further position as a flexible and scalable framework solution for processing long-read single-cell transcriptomic datasets.
使用牛津纳米孔技术(ONT)等平台进行的长读长单细胞RNA测序能够在单细胞分辨率下进行全长转录组分析。然而,高测序错误率、多样的文库结构以及不断增加的数据集规模,给准确识别细胞条形码(CBC)和独特分子标识符(UMI)带来了重大挑战,而这分别是可靠的多路分解和重复数据删除的关键前提条件。现有的流程依赖于硬编码的启发式方法或局部转换规则,这些方法无法完全捕捉这种更广泛的结构背景,并且常常无法稳健地解读因插入缺失导致移位、片段截断或非规范元件排序的 reads。我们引入了TRANscript QUantification In Long reads-anaLYZER(长读长转录本定量分析器),这是一个灵活的、具有结构感知能力的深度学习框架,用于处理长读长单细胞RNA测序数据。该框架采用混合神经网络架构和全局的、上下文感知设计,即使元件因测序噪声或文库构建变异性而移位、部分降解或重复时,也能精确识别结构元件。除了支持已有的单细胞协议外,通过对用户定义的标签模式进行快速的一次性模型训练,该框架还能适应自定义文库格式,在标准GPU上通常只需几个小时即可完成。诸如跨大型数据集的可扩展性和全面的可视化功能等其他特性,进一步将该框架定位为处理长读长单细胞转录组数据集的灵活且可扩展的框架解决方案。