Suppr超能文献

基于下一代 mRNA 测序(RNA-Seq)数据的稀疏线性建模用于发现异构体和丰度估计。

Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation.

机构信息

Department of Statistics, University of California, Berkeley, CA 94720, USA.

出版信息

Proc Natl Acad Sci U S A. 2011 Dec 13;108(50):19867-72. doi: 10.1073/pnas.1113972108. Epub 2011 Dec 1.

Abstract

Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.

摘要

自新一代 mRNA 测序(RNA-Seq)技术问世以来,人们一直试图利用 RNA-Seq 数据从头组装全长 mRNA 异构体并估计异构体的丰度。然而,对于具有多个外显子的基因,这个问题往往具有挑战性,并且通常涉及到统计建模中的可识别性问题。我们开发了一种名为“用于异构体发现和丰度估计的 RNA-Seq 数据稀疏线性建模”(SLIDE)的统计方法,它将外显子边界和 RNA-Seq 数据作为输入,以辨别最有可能出现在 RNA-Seq 样本中的 mRNA 异构体集。SLIDE 基于一个线性模型,该模型的设计矩阵对从不同 mRNA 异构体中采样的 RNA-Seq reads 的概率进行建模。为了解决模型不可识别性问题,SLIDE 使用了一种改进的 Lasso 程序进行参数估计。与确定性异构体组装算法(例如 Cufflinks)相比,SLIDE 考虑了来自不同异构体的外显子中 RNA-Seq reads 的随机性质,因此在检测更多新异构体方面具有更高的功效。SLIDE 的另一个优势是它可以灵活地将其他转录组数据(例如 RACE、CAGE 和 EST)纳入其模型中,以进一步提高异构体发现的准确性。SLIDE 还可以在其他 RNA-Seq 组装算法的下游工作,以整合新发现的基因和外显子。除了异构体发现,SLIDE 还使用相同的线性模型顺序估计发现的异构体的丰度。模拟和真实数据研究表明,SLIDE 在异构体发现和丰度估计方面的性能与主要竞争对手一样好,甚至更好。SLIDE 软件包可在 https://sites.google.com/site/jingyijli/SLIDE.zip 获得。

相似文献

10
AIDE: annotation-assisted isoform discovery with high precision.AIDE:高精度注释辅助的异构体发现。
Genome Res. 2019 Dec;29(12):2056-2072. doi: 10.1101/gr.251108.119. Epub 2019 Nov 6.

引用本文的文献

5
Efficient Minimum Flow Decomposition via Integer Linear Programming.通过整数线性规划实现有效的最小流量分解。
J Comput Biol. 2022 Nov;29(11):1252-1267. doi: 10.1089/cmb.2022.0257. Epub 2022 Oct 18.
7
Modern Approaches for Transcriptome Analyses in Plants.现代植物转录组分析方法。
Adv Exp Med Biol. 2021;1346:11-50. doi: 10.1007/978-3-030-80352-0_2.
8

本文引用的文献

7
Ensembl 2011.Ensembl 2011年版
Nucleic Acids Res. 2011 Jan;39(Database issue):D800-6. doi: 10.1093/nar/gkq1064. Epub 2010 Nov 2.
8
The UCSC Genome Browser database: update 2011.加州大学圣克鲁兹分校基因组浏览器数据库:2011年更新
Nucleic Acids Res. 2011 Jan;39(Database issue):D876-82. doi: 10.1093/nar/gkq963. Epub 2010 Oct 18.
10
Modeling non-uniformity in short-read rates in RNA-Seq data.RNA-Seq 数据中短读率非均匀性建模。
Genome Biol. 2010;11(5):R50. doi: 10.1186/gb-2010-11-5-r50. Epub 2010 May 11.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验