Bonafede Elisabetta, Picard Franck, Robin Stéphane, Viroli Cinzia
Department of Statistical Sciences, University of Bologna, 40126 Italy.
Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558 Univ. Lyon 1, F-69622 Villeurbanne, France.
Biometrics. 2016 Sep;72(3):804-14. doi: 10.1111/biom.12458. Epub 2015 Dec 18.
Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using negative binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead, we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show through a wide simulation study that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it reaches the nominal value for the type-I error, while keeping elevate discriminative power between differentially and not differentially expressed genes. The method is finally illustrated on prostate cancer RNA-Seq data.
新一代测序技术如今已成为衡量基因表达的一种首选方法。要分析的数据是读取计数,通常使用负二项分布进行建模。与这个概率框架相关的一个相关问题是过度离散参数的可靠估计,而每个基因通常可观察到的重复数量有限,这进一步强化了这个问题。已经提出了许多策略来估计这个参数,但当目的是进行差异分析时,它们往往会导致基于插件估计的程序,并且我们在此表明,估计框架和测试框架之间的这种差异可能会导致不受控制的I型错误。相反,我们提出了一种混合模型,该模型允许每个基因与其他具有相似变异性的基因共享信息。针对差异表达分析开发了三种一致的统计检验。我们通过广泛的模拟研究表明,所提出的方法相对于常用程序提高了检测差异表达基因的灵敏度,因为它达到了I型错误的标称值,同时在差异表达和非差异表达基因之间保持了较高的判别能力。该方法最终在前列腺癌RNA测序数据上得到了验证。