Biostatistics Resource, Keck Laboratory, Yale University, 300 George Street, New Haven, Connecticut, 06510, USA.
BMC Bioinformatics. 2011 Jul 19;12:290. doi: 10.1186/1471-2105-12-290.
High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates.
In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively.
Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.
高通量测序技术为我们研究转录组动态提供了前所未有的机会。与基于微阵列的基因表达谱分析相比,RNA-Seq 具有许多优势,例如高分辨率、低背景和识别新转录本的能力。此外,对于具有多个异构体的基因,可以从 RNA-Seq 数据估计每个异构体的表达。尽管有这些优势,但最近的工作表明,RNA-Seq 数据的碱基水平读数计数可能不是随机分布的,并且可能受到局部核苷酸组成的影响。不过,碱基水平读数计数偏差如何影响基因水平表达估计还不清楚。
在本文中,我们使用来自不同生物来源和不同数据预处理方案的五个已发表的 RNA-Seq 数据集,表明从 RNA-Seq 数据中常用的基因表达水平估计值,例如每百万读碱基的每千碱基基因长度的读数(RPKM),在基因长度、GC 含量和二核苷酸频率方面存在偏差。我们直接在基因水平上检查了偏差,并提出了一种简单的基于广义加性模型的方法来同时校正不同的偏差源。与之前提出的碱基水平校正方法相比,我们的方法更有效地减少了基因水平表达估计中的偏差。
我们的方法识别并校正了 RNA-Seq 数据中基因水平表达测量中的不同偏差源,并提供了更准确的 RNA-Seq 基因表达水平估计值。该方法在使用不同平台或实验方案进行基因表达水平的荟萃分析时应该很有用。