Fu Yuntian, Kim Heonseok, Adams Jenea I, Grimes Susan M, Huang Sijia, Lau Billy T, Sathe Anuja, Hess Paul, Ji Hanlee P, Zhang Nancy R
Graduate Program in Genomics and Computational Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA.
Res Sq. 2023 Mar 21:rs.3.rs-2674892. doi: 10.21203/rs.3.rs-2674892/v1.
Long-read sequencing has become a powerful tool for alternative splicing analysis. However, technical and computational challenges have limited our ability to explore alternative splicing at single cell and spatial resolution. The higher sequencing error of long reads, especially high indel rates, have limited the accuracy of cell barcode and unique molecular identifier (UMI) recovery. Read truncation and mapping errors, the latter exacerbated by the higher sequencing error rates, can cause the false detection of spurious new isoforms. Downstream, there is yet no rigorous statistical framework to quantify splicing variation within and between cells/spots. In light of these challenges, we developed Longcell, a statistical framework and computational pipeline for accurate isoform quantification for single cell and spatial spot barcoded long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery, and UMI-based truncation- and mapping-error correction. Through a statistical model that accounts for varying read coverage across cells/spots, Longcell rigorously quantifies the level of inter-cell/spot versus intra-cell/ spot diversity in exon-usage and detects changes in splicing distributions between cell populations. Applying Longcell to single cell long-read data from multiple contexts, we found that intra-cell splicing heterogeneity, where multiple isoforms co-exist within the same cell, is ubiquitous for highly expressed genes. On matched single cell and Visium long read sequencing for a tissue of colorectal cancer metastasis to the liver, Longcell found concordant signals between the two data modalities. Finally, on a perturbation experiment for 9 splicing factors, Longcell identified regulatory targets that are validated by targeted sequencing.
长读长测序已成为可变剪接分析的强大工具。然而,技术和计算方面的挑战限制了我们在单细胞和空间分辨率下探索可变剪接的能力。长读长测序错误率较高,尤其是高插入缺失率,限制了细胞条形码和独特分子标识符(UMI)恢复的准确性。读段截断和映射错误,后者因较高的测序错误率而加剧,可能导致虚假新异构体的错误检测。在下游,尚无严格的统计框架来量化细胞内和细胞间/斑点间的剪接变异。鉴于这些挑战,我们开发了Longcell,这是一个用于对单细胞和空间斑点条形码长读长测序数据进行准确异构体定量的统计框架和计算流程。Longcell在计算上高效地进行细胞/斑点条形码提取、UMI恢复以及基于UMI的截断和映射错误校正。通过一个考虑细胞/斑点间不同读段覆盖情况的统计模型,Longcell严格量化细胞间/斑点间与细胞内/斑点内外显子使用多样性水平,并检测细胞群体间剪接分布的变化。将Longcell应用于来自多种背景的单细胞长读长数据,我们发现对于高表达基因,同一细胞内多种异构体共存的细胞内剪接异质性普遍存在。在匹配的单细胞和用于结直肠癌肝转移组织的Visium长读长测序中,Longcell在两种数据模式之间发现了一致的信号。最后,在针对9种剪接因子的扰动实验中,Longcell鉴定出了经靶向测序验证的调控靶点。