Lun Aaron T L, Smyth Gordon K
.
Stat Appl Genet Mol Biol. 2017 Apr 25;16(2):83-93. doi: 10.1515/sagmb-2017-0010.
RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.
RNA测序(RNA-seq)被广泛用于研究与治疗或生物学条件相关的基因表达变化。许多从RNA-seq数据中检测差异表达(DE)的常用方法使用广义线性模型(GLM),该模型适用于每个基因在独立重复样本中的读数计数。本文表明,当模型包含恰好为零的拟合值时,线性模型中残差自由度(d.f.)的标准公式被高估了。当治疗组中的所有计数都为零时,以及在更复杂的模型(如涉及配对比较的模型)中,都会出现这种拟合值。这种错误指定会导致基因方差的低估和I型错误控制的丧失。本文提出了一个用于减少残差d.f.的公式,该公式可在模拟的RNA-seq数据中恢复错误控制,并在实际数据分析中改进对DE基因的检测。新方法在edgeR软件包的拟似然框架中实现。本文的结果也适用于将线性模型应用于对数转换计数的RNA-seq分析,如limma软件包中的分析,更一般地适用于任何可能出现恰好为零的拟合值的基于计数的GLM。