Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA and Cardiovascular Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA.
Nucleic Acids Res. 2014 Feb;42(3):e20. doi: 10.1093/nar/gkt1304. Epub 2013 Dec 20.
Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at http://sourceforge.net/projects/pennseq.
正确估计基因的异构体特异性表达对于理解复杂的生物学机制和定位疾病易感基因非常重要。然而,估计基因的异构体特异性表达具有挑战性,因为 RNA-Seq(RNA 测序)数据中存在各种偏差,这使得分析变得复杂,如果不进行适当的校正,可能会影响异构体表达的估计和下游分析。在本文中,我们提出了 PennSeq,这是一种统计方法,允许每个异构体具有自己的非均匀读分布。我们不做参数假设,而是通过使用非参数方法为基础数据赋予足够的权重。我们的基本原理是,无论是什么因素导致非均匀性,无论是由于六聚体引发偏倚、局部序列偏倚、位置偏倚、RNA 降解、映射偏倚还是其他未知原因,片段从特定区域被采样的概率将反映在对齐的数据中。这种经验方法因此最大限度地反映了真实的基础非均匀读分布。我们使用具有已知真实值的模拟数据以及两个真实的 Illumina RNA-Seq 数据集(包括一个具有定量实时聚合酶链反应测量的数据集)来评估 PennSeq 的性能。我们的结果表明,PennSeq 的性能优于现有方法,特别是对于表现出严重非均匀性的异构体。PennSeq 可在 http://sourceforge.net/projects/pennseq 上免费下载。