Department of Statistics, Stanford University, Stanford, CA 94305, USA.
Biostatistics. 2012 Jul;13(3):523-38. doi: 10.1093/biostatistics/kxr031. Epub 2011 Oct 14.
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
我们讨论了在 RNA 测序和其他基于序列的比较基因组实验中识别与结果相关的基因的问题。RNA 测序数据采用计数的形式,因此基于正态分布的模型并不适用。此外,由于不同的测序实验可能产生非常不同的总读取数,因此标准化具有挑战性。为了克服这些困难,我们使用带有新归一化方法的对数线性模型。我们推导出一种新的估计错误发现率 (FDR) 的方法。我们的方法可用于具有定量、两分类或多分类结果的数据,即使对于大型数据集,计算速度也很快。我们研究了我们的方法在显著性计算和 FDR 估计方面的准确性,并证明了我们的方法相对于基于泊松或负二项式模型的现有方法具有潜在的优势。总之,这项工作为测序数据的显著性分析提供了一个流程。