Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, USA.
Bioinformatics. 2010 Feb 15;26(4):493-500. doi: 10.1093/bioinformatics/btp692. Epub 2009 Dec 18.
RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.
We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20-25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
RNA-Seq 是一种很有前途的新技术,可以准确测量基因表达水平。使用 RNA-Seq 进行表达估计需要将相对较短的测序读取映射到参考基因组或转录本集。由于读取通常比它们衍生的转录本短,因此单个读取可能会映射到多个基因和异构体,从而使表达分析变得复杂。以前的计算方法要么丢弃映射到多个位置的读取,要么根据启发式方法将它们分配给基因。
我们提出了一种生成式统计模型和相关的推理方法,以合理的方式处理读取映射不确定性。通过用真实的 RNA-Seq 数据参数化的模拟,我们表明我们的方法比以前的方法更准确。我们提高的准确性是通过使用统计模型处理读取映射不确定性和将基因表达水平估计为异构体表达水平之和的结果。与以前的方法不同,我们的方法能够对非均匀的读取分布进行建模。当测序通量固定时,使用我们的方法进行模拟表明,对于来自小鼠和玉米的 RNA-Seq 数据的基因水平表达估计,20-25 个碱基的读取长度是最优的。