Ma Cong, Zheng Hongyu, Kingsford Carl
Computer Science Department, School of Engineering and Applied Science, Princeton University, 35 Olden Street, Princeton, NJ, 08544, USA.
Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA.
Algorithms Mol Biol. 2021 May 10;16(1):5. doi: 10.1186/s13015-021-00184-7.
The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447-55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed.
We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method.
We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis.
可以使用剪接图中剪接连接的丰度直接对一组RNA测序读数进行测序的概率建模,而不是使用转录本列表的丰度。我们将此模型称为图谱定量,这是由伯纳德等人首次提出的(《生物信息学》30:2447 - 55,2014年)。该模型可被视为转录本表达定量的一种推广,其中剪接图中的每条完整路径都是一个可能的转录本。然而,先前的图谱定量模型假定单端读数或双端片段的长度是固定的。
我们对该模型进行了改进,以处理可变长度的读数或片段,并纳入偏差校正。我们证明,我们的模型等同于使用所有兼容转录本的集合精确运行转录本定量器。我们方法的关键是基于Aho - Corasick自动机构建剪接图的扩展。等价性证明基于对一种先进转录本定量方法的读数生成模型的新颖重新参数化。
我们提出了一种新的图谱定量方法,该方法对于参考转录组不完整或不可用的建模场景很有用,并且可进一步用于转录组组装或可变剪接分析。