Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
Bioinformatics. 2011 Mar 1;27(5):662-9. doi: 10.1093/bioinformatics/btr005. Epub 2011 Jan 19.
Next-generation sequencing technologies are being rapidly applied to quantifying transcripts (RNA-seq). However, due to the unique properties of the RNA-seq data, the differential expression of longer transcripts is more likely to be identified than that of shorter transcripts with the same effect size. This bias complicates the downstream gene set analysis (GSA) because the methods for GSA previously developed for microarray data are based on the assumption that genes with same effect size have equal probability (power) to be identified as significantly differentially expressed. Since transcript length is not related to gene expression, adjusting for such length dependency in GSA becomes necessary.
In this article, we proposed two approaches for transcript-length adjustment for analyses based on Poisson models: (i) At individual gene level, we adjusted each gene's test statistic using the square root of transcript length followed by testing for gene set using the Wilcoxon rank-sum test. (ii) At gene set level, we adjusted the null distribution for the Fisher's exact test by weighting the identification probability of each gene using the square root of its transcript length. We evaluated these two approaches using simulations and a real dataset, and showed that these methods can effectively reduce the transcript-length biases. The top-ranked GO terms obtained from the proposed adjustments show more overlaps with the microarray results.
R scripts are at http://www.soph.uab.edu/Statgenetics/People/XCui/r-codes/.
下一代测序技术正在被迅速应用于转录本的定量分析(RNA-seq)。然而,由于 RNA-seq 数据的独特性质,对于具有相同效应大小的更长转录本的差异表达,比具有更短转录本的差异表达更有可能被识别。这种偏差使下游基因集分析(GSA)变得复杂,因为之前为微阵列数据开发的 GSA 方法基于这样的假设,即具有相同效应大小的基因具有相同的被识别为显著差异表达的概率(功效)。由于转录本长度与基因表达无关,因此在 GSA 中进行这种长度依赖性的调整是必要的。
在本文中,我们提出了两种基于泊松模型的转录本长度调整方法:(i)在单个基因水平上,我们使用转录本长度的平方根调整每个基因的检验统计量,然后使用 Wilcoxon 秩和检验对基因集进行检验。(ii)在基因集水平上,我们通过使用转录本长度的平方根来加权每个基因的识别概率,从而调整 Fisher 精确检验的零分布。我们使用模拟数据和真实数据集评估了这两种方法,结果表明这些方法可以有效地减少转录本长度的偏差。从提出的调整中获得的排名最高的 GO 术语与微阵列结果的重叠更多。
R 脚本可在 http://www.soph.uab.edu/Statgenetics/People/XCui/r-codes/ 获得。