Suppr超能文献

利用多样本 RNA-Seq 数据联合估计异构体表达和异构体特异性读取分布。

Joint estimation of isoform expression and isoform-specific read distribution using multisample RNA-Seq data.

机构信息

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, Department of Molecular and Translational Medicine, University of Brescia, Italy and Department of Mathematics and Statistics, La Trobe University, Australia.

出版信息

Bioinformatics. 2014 Feb 15;30(4):506-13. doi: 10.1093/bioinformatics/btt704. Epub 2013 Dec 3.

Abstract

MOTIVATION

RNA-sequencing technologies provide a powerful tool for expression analysis at gene and isoform level, but accurate estimation of isoform abundance is still a challenge. Standard assumption of uniform read intensity would yield biased estimates when the read intensity is in fact non-uniform. The problem is that, without strong assumptions, the read intensity pattern is not identifiable from data observed in a single sample.

RESULTS

We develop a joint statistical model that accounts for non-uniform isoform-specific read distribution and gene isoform expression estimation. The main challenge is in dealing with the large number of isoform-specific read distributions, which potentially are as many as the number of splice variants in the genome. A statistical regularization via a smoothing penalty is imposed to control the estimation. Also, for identifiability reasons, the method uses information across samples from the same region. We develop a fast and robust computational procedure based on the iterated-weighted least-squares algorithm, and apply it to simulated data and two real RNA-Seq datasets with reverse transcription-polymerase chain reaction validation. Empirical tests show that our model performs better than existing methods in terms of increasing precision in isoform-level estimation.

AVAILABILITY AND IMPLEMENTATION

We have implemented our method in an R package called Sequgio as a pipeline for fast processing of RNA-Seq data.

摘要

动机

RNA 测序技术为基因和异构体水平的表达分析提供了强大的工具,但异构体丰度的准确估计仍然是一个挑战。当读段强度实际上是非均匀的时,均匀读段强度的标准假设会产生有偏差的估计。问题是,在没有强假设的情况下,从单个样本中观察到的数据无法识别读段强度模式。

结果

我们开发了一个联合统计模型,该模型考虑了非均匀的异构体特异性读分布和基因异构体表达估计。主要的挑战是处理大量的异构体特异性读分布,其数量可能与基因组中的剪接变体数量一样多。通过平滑惩罚施加统计正则化以控制估计。此外,出于可识别性的原因,该方法使用来自同一区域的样本信息。我们开发了一种基于迭代加权最小二乘法的快速稳健计算程序,并将其应用于模拟数据和两个具有逆转录-聚合酶链反应验证的真实 RNA-Seq 数据集。经验测试表明,我们的模型在提高异构体水平估计的精度方面优于现有方法。

可用性和实现

我们已将我们的方法实现为一个名为 Sequgio 的 R 包,作为快速处理 RNA-Seq 数据的流水线。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验