Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305, USA.
Genome Biol. 2010;11(5):R50. doi: 10.1186/gb-2010-11-5-r50. Epub 2010 May 11.
After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.
经过映射后,RNA-Seq 数据可以通过一系列读取计数进行总结,通常这些计数可以建模为泊松变量,在每个转录本上具有恒定的速率,但实际上这些模型很难准确拟合数据。我们建议对不同位置使用可变速率,并提出了两种基于局部序列预测这些速率的模型。这些模型可以解释超过 50%的变异,并可以提高对 Illumina 和 Applied Biosystems 数据的基因和异构体表达的估计。