Chen Zhongxue, Liu Jianzhong, Ng Hon Keung Tony, Nadarajah Saralees, Kaufman Howard L, Yang Jack Y, Deng Youping
Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
BMC Syst Biol. 2011;5 Suppl 3(Suppl 3):S1. doi: 10.1186/1752-0509-5-S3-S1. Epub 2011 Dec 23.
For RNA-seq data, the aggregated counts of the short reads from the same gene is used to approximate the gene expression level. The count data can be modelled as samples from Poisson distributions with possible different parameters. To detect differentially expressed genes under two situations, statistical methods for detecting the difference of two Poisson means are used. When the expression level of a gene is low, i.e., the number of count is small, it is usually more difficult to detect the mean differences, and therefore statistical methods which are more powerful for low expression level are particularly desirable. In statistical literature, several methods have been proposed to compare two Poisson means (rates). In this paper, we compare these methods by using simulated and real RNA-seq data.
Through simulation study and real data analysis, we find that the Wald test with the data being log-transformed is more powerful than other methods, including the likelihood ratio test, which has similar power as the variance stabilizing transformation test; both are more powerful than the conditional exact test and Fisher exact test.
When the count data in RNA-seq can be reasonably modelled as Poisson distribution, the Wald-Log test is more powerful and should be used to detect the differentially expressed genes.
对于RNA测序数据,来自同一基因的短读段的汇总计数用于近似基因表达水平。计数数据可建模为来自具有可能不同参数的泊松分布的样本。为了检测两种情况下的差异表达基因,使用检测两个泊松均值差异的统计方法。当基因的表达水平较低时,即计数数量较少时,通常更难检测到均值差异,因此对于低表达水平更具效力的统计方法尤为可取。在统计文献中,已经提出了几种比较两个泊松均值(比率)的方法。在本文中,我们通过使用模拟和真实的RNA测序数据来比较这些方法。
通过模拟研究和实际数据分析,我们发现对数据进行对数转换后的Wald检验比其他方法更具效力,包括似然比检验,其效力与方差稳定变换检验相似;这两种检验都比条件精确检验和Fisher精确检验更具效力。
当RNA测序中的计数数据可以合理地建模为泊松分布时,Wald-Log检验更具效力,应使用它来检测差异表达基因。