Department of Statistics, University of California, Berkeley, CA 94720, USA.
Proc Natl Acad Sci U S A. 2011 Dec 13;108(50):19867-72. doi: 10.1073/pnas.1113972108. Epub 2011 Dec 1.
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.
自新一代 mRNA 测序(RNA-Seq)技术问世以来,人们一直试图利用 RNA-Seq 数据从头组装全长 mRNA 异构体并估计异构体的丰度。然而,对于具有多个外显子的基因,这个问题往往具有挑战性,并且通常涉及到统计建模中的可识别性问题。我们开发了一种名为“用于异构体发现和丰度估计的 RNA-Seq 数据稀疏线性建模”(SLIDE)的统计方法,它将外显子边界和 RNA-Seq 数据作为输入,以辨别最有可能出现在 RNA-Seq 样本中的 mRNA 异构体集。SLIDE 基于一个线性模型,该模型的设计矩阵对从不同 mRNA 异构体中采样的 RNA-Seq reads 的概率进行建模。为了解决模型不可识别性问题,SLIDE 使用了一种改进的 Lasso 程序进行参数估计。与确定性异构体组装算法(例如 Cufflinks)相比,SLIDE 考虑了来自不同异构体的外显子中 RNA-Seq reads 的随机性质,因此在检测更多新异构体方面具有更高的功效。SLIDE 的另一个优势是它可以灵活地将其他转录组数据(例如 RACE、CAGE 和 EST)纳入其模型中,以进一步提高异构体发现的准确性。SLIDE 还可以在其他 RNA-Seq 组装算法的下游工作,以整合新发现的基因和外显子。除了异构体发现,SLIDE 还使用相同的线性模型顺序估计发现的异构体的丰度。模拟和真实数据研究表明,SLIDE 在异构体发现和丰度估计方面的性能与主要竞争对手一样好,甚至更好。SLIDE 软件包可在 https://sites.google.com/site/jingyijli/SLIDE.zip 获得。