Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, United States.
Program in Computational Biology, Bioinformatics and Genomices, University of Maryland, College Park, MD 20742, United States.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i297-i306. doi: 10.1093/bioinformatics/btae207.
Short-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses.
We develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of many reads and identify the true gene origin of multi-gene mapped reads.
Forseti and the code used for producing the results are available at https://github.com/COMBINE-lab/forseti under a BSD 3-clause license.
短读单细胞 RNA 测序 (scRNA-seq) 已被用于研究细胞异质性、细胞命运和转录动态。在 scRNA-seq 数据中建模剪接动态具有挑战性,即使在阐明从测序片段中提取的分子的剪接状态这一看似简单的任务中,也存在固有困难。这种困难部分源于读长有限和位置偏差,这大大降低了测序片段的特异性。因此,由于缺乏明确的证据,许多 scRNA-seq 读取的剪接状态是模糊的。因此,我们需要能够恢复模糊读取的剪接状态的方法,这反过来又可以提高下游分析的准确性和置信度。
我们开发了 Forseti,这是一种预测模型,可以概率性地为 scRNA-seq 读取分配剪接状态。我们的模型有两个关键组成部分。首先,我们训练了一个结合亲和力模型,为给定的转录本位点在片段生成中被使用的概率分配一个概率。其次,我们拟合了一个稳健的片段长度分布模型,该模型可以很好地推广到来自不同物种和组织类型的数据集。Forseti 将这两个训练好的模型结合起来,通过对每个测序读取的对齐与邻近潜在启动子位点相关的假设片段进行评分,从而预测读取来源分子的剪接状态。使用模拟和实验数据,我们表明我们的模型可以精确预测许多读取的剪接状态,并识别多基因映射读取的真实基因起源。
Forseti 和用于生成结果的代码可在 https://github.com/COMBINE-lab/forseti 下获得,许可证为 BSD 3 条款许可证。