Paşaniuc Bogdan, Zaitlen Noah, Halperin Eran
Departments of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA.
J Comput Biol. 2011 Mar;18(3):459-68. doi: 10.1089/cmb.2010.0259.
Abstract Next generation high-throughput sequencing (NGS) is poised to replace array-based technologies as the experiment of choice for measuring RNA expression levels. Several groups have demonstrated the power of this new approach (RNA-seq), making significant and novel contributions and simultaneously proposing methodologies for the analysis of RNA-seq data. In a typical experiment, millions of short sequences (reads) are sampled from RNA extracts and mapped back to a reference genome. The number of reads mapping to each gene is used as proxy for its corresponding RNA concentration. A significant challenge in analyzing RNA expression of homologous genes is the large fraction of the reads that map to multiple locations in the reference genome. Currently, these reads are either dropped from the analysis, or a naive algorithm is used to estimate their underlying distribution. In this work, we present a rigorous alternative for handling the reads generated in an RNA-seq experiment within a probabilistic model for RNA-seq data; we develop maximum likelihood-based methods for estimating the model parameters. In contrast to previous methods, our model takes into account the fact that the DNA of the sequenced individual is not a perfect copy of the reference sequence. We show with both simulated and real RNA-seq data that our new method improves the accuracy and power of RNA-seq experiments.
摘要 下一代高通量测序(NGS)有望取代基于芯片的技术,成为测量RNA表达水平的首选实验方法。多个研究小组已经证明了这种新方法(RNA测序)的强大功能,做出了重要且新颖的贡献,同时还提出了分析RNA测序数据的方法。在一个典型的实验中,从RNA提取物中采样数百万个短序列(读数),并将其映射回参考基因组。映射到每个基因的读数数量被用作其相应RNA浓度的代理。分析同源基因RNA表达的一个重大挑战是,很大一部分读数映射到参考基因组的多个位置。目前,这些读数要么从分析中剔除,要么使用简单的算法来估计其潜在分布。在这项工作中,我们提出了一种严格的替代方法,用于在RNA测序数据的概率模型中处理RNA测序实验中产生的读数;我们开发了基于最大似然的方法来估计模型参数。与以前的方法不同,我们的模型考虑到测序个体的DNA并非参考序列的完美拷贝这一事实。我们通过模拟和真实的RNA测序数据表明,我们的新方法提高了RNA测序实验的准确性和效能。