Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.
Nucleic Acids Res. 2012 May;40(10):4288-97. doi: 10.1093/nar/gks042. Epub 2012 Jan 28.
A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies. It provides the ability to analyse complex experiments involving multiple treatment conditions and blocking variables while still taking full account of biological variation. Biological variation between RNA samples is estimated separately from the technical variation associated with sequencing technologies. Novel empirical Bayes methods allow each gene to have its own specific variability, even when there are relatively few biological replicates from which to estimate such variability. The pipeline is implemented in the edgeR package of the Bioconductor project. A case study analysis of carcinoma data demonstrates the ability of generalized linear model methods (GLMs) to detect differential expression in a paired design, and even to detect tumour-specific expression changes. The case study demonstrates the need to allow for gene-specific variability, rather than assuming a common dispersion across genes or a fixed relationship between abundance and variability. Genewise dispersions de-prioritize genes with inconsistent results and allow the main analysis to focus on changes that are consistent between biological replicates. Parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical. Simulations demonstrate the ability of adjusted profile likelihood estimators to return accurate estimators of biological variability in complex situations. When variation is gene-specific, empirical Bayes estimators provide an advantageous compromise between the extremes of assuming common dispersion or separate genewise dispersion. The methods developed here can also be applied to count data arising from DNA-Seq applications, including ChIP-Seq for epigenetic marks and DNA methylation analyses.
我们开发了一个灵活的统计框架,用于分析 RNA-Seq 基因表达研究中的读取计数。它提供了分析涉及多个处理条件和阻断变量的复杂实验的能力,同时仍然充分考虑了生物学变异。从与测序技术相关的技术变异中分别估计 RNA 样本之间的生物学变异。新的经验贝叶斯方法允许每个基因都有其自己的特定可变性,即使从其中估计这种可变性的生物学重复相对较少。该流水线在 Bioconductor 项目的 edgeR 包中实现。对癌数据的案例研究分析表明,广义线性模型方法(GLMs)能够在配对设计中检测差异表达,甚至能够检测肿瘤特异性表达变化。该案例研究表明需要允许基因特异性可变性,而不是假设基因之间的共同离散度或丰度和可变性之间的固定关系。基因特异性分散度使不一致结果的基因处于优先级较低的位置,并允许主分析集中在生物重复之间一致的变化上。开发了并行计算方法来使非线性模型拟合更快更可靠,从而使 GLMs 更方便实用地应用于基因组数据。模拟表明,调整后的似然比估计量能够在复杂情况下返回生物变异性的准确估计。当变异是基因特异性时,经验贝叶斯估计器在假设共同离散度或单独基因特异性离散度之间提供了有利的折衷。这里开发的方法也可以应用于源自 DNA-Seq 应用的计数数据,包括用于表观遗传标记和 DNA 甲基化分析的 ChIP-Seq。