Manda Samuel O M, Walls Rebecca E, Gilthorpe Mark S
Biostatistics Unit, Centre for Epidemiology and Biostatistics, Leeds, UK.
BMC Bioinformatics. 2007 Apr 17;8:124. doi: 10.1186/1471-2105-8-124.
In many laboratory-based high throughput microarray experiments, there are very few replicates of gene expression levels. Thus, estimates of gene variances are inaccurate. Visual inspection of graphical summaries of these data usually reveals that heteroscedasticity is present, and the standard approach to address this is to take a log2 transformation. In such circumstances, it is then common to assume that gene variability is constant when an analysis of these data is undertaken. However, this is perhaps too stringent an assumption. More careful inspection reveals that the simple log2 transformation does not remove the problem of heteroscedasticity. An alternative strategy is to assume independent gene-specific variances; although again this is problematic as variance estimates based on few replications are highly unstable. More meaningful and reliable comparisons of gene expression might be achieved, for different conditions or different tissue samples, where the test statistics are based on accurate estimates of gene variability; a crucial step in the identification of differentially expressed genes.
We propose a Bayesian mixture model, which classifies genes according to similarity in their variance. The result is that genes in the same latent class share the similar variance, estimated from a larger number of replicates than purely those per gene, i.e. the total of all replicates of all genes in the same latent class. An example dataset, consisting of 9216 genes with four replicates per condition, resulted in four latent classes based on their similarity of the variance.
The mixture variance model provides a realistic and flexible estimate for the variance of gene expression data under limited replicates. We believe that in using the latent class variances, estimated from a larger number of genes in each derived latent group, the p-values obtained are more robust than either using a constant gene or gene-specific variance estimate.
在许多基于实验室的高通量微阵列实验中,基因表达水平的重复样本非常少。因此,基因方差的估计不准确。对这些数据的图形摘要进行目视检查通常会发现存在异方差性,解决此问题的标准方法是进行log2转换。在这种情况下,在对这些数据进行分析时,通常会假设基因变异性是恒定的。然而,这可能是一个过于严格的假设。更仔细的检查表明,简单的log2转换并不能消除异方差性问题。另一种策略是假设独立的基因特异性方差;尽管同样存在问题,因为基于少量重复样本的方差估计非常不稳定。对于不同条件或不同组织样本,在检验统计量基于基因变异性的准确估计时,可能会实现更有意义和可靠的基因表达比较;这是鉴定差异表达基因的关键步骤。
我们提出了一种贝叶斯混合模型,该模型根据基因方差的相似性对基因进行分类。结果是,同一潜在类别的基因具有相似的方差,该方差是根据比每个基因单独的重复样本更多的重复样本估计得出的,即同一潜在类别中所有基因的所有重复样本的总和。一个示例数据集由每个条件下有四个重复样本的9216个基因组成,根据方差相似性得到了四个潜在类别。
混合方差模型为有限重复样本下的基因表达数据方差提供了现实且灵活的估计。我们认为,使用从每个派生的潜在组中的更多基因估计出的潜在类别方差,所获得的p值比使用恒定基因或基因特异性方差估计更稳健。