He Dongze, Gao Yuan, Chan Spencer Skylar, Quintana-Parrilla Natalia, Patro Rob
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.
Program in Computational Biology, Bioinformatics and Genomices, University of Maryland, College Park, MD 20742, USA.
bioRxiv. 2024 Feb 5:2024.02.01.577813. doi: 10.1101/2024.02.01.577813.
Short-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses.
We develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of reads and identify the true gene origin of multi-gene mapped reads.
Forseti and the code used for producing the results are available at https://github.com/COMBINE-lab/forseti under a BSD 3-clause license.
短读长单细胞RNA测序(scRNA-seq)已被用于研究细胞异质性、细胞命运和转录动力学。对scRNA-seq数据中的剪接动力学进行建模具有挑战性,即使是阐明测序片段所源自的分子的剪接状态这一看似简单的任务也存在内在困难。这种困难部分源于读长有限和位置偏差,这大大降低了测序片段的特异性。因此,由于缺乏确凿证据,scRNA-seq中许多读段的剪接状态不明确。因此,我们需要能够恢复不明确读段剪接状态的方法,这反过来又可以提高下游分析的准确性和可信度。
我们开发了Forseti,这是一种预测模型,用于概率性地为scRNA-seq读段分配剪接状态。我们的模型有两个关键组件。首先,我们训练一个结合亲和力模型,以确定给定转录组位点在片段生成中被使用的概率。其次,我们拟合一个稳健的片段长度分布模型,该模型在来自不同物种和组织类型的数据集上具有良好的泛化能力。Forseti结合这两个经过训练的模型,通过对将测序读段的每个比对与邻近潜在引物位点相关联的假定片段进行评分,来预测读段起源分子的剪接状态。使用模拟数据和实验数据,我们表明我们的模型可以精确预测读段的剪接状态,并识别多基因映射读段的真正基因起源。
Forseti以及用于产生结果的代码可在https://github.com/COMBINE-lab/forseti上以BSD 3条款许可获得。