Thygesen Helene H, Zwinderman Aeilko H
Clinical Epidemiology and Biostatistics, Academisch Medisch Centrum, University of Amsterdam, Meibergdreef 9, 1100 DD Amsterdam, The Netherlands.
BMC Bioinformatics. 2006 Mar 20;7:157. doi: 10.1186/1471-2105-7-157.
Serial Analysis of Gene Expressions (SAGE) produces gene expression measurements on a discrete scale, due to the finite number of molecules in the sample. This means that part of the variance in SAGE data should be understood as the sampling error in a binomial or Poisson distribution, whereas other variance sources, in particular biological variance, should be modeled using a continuous distribution function, i.e. a prior on the intensity of the Poisson distribution. One challenge is that such a model predicts a large number of genes with zero counts, which cannot be observed.
We present a hierarchical Poisson model with a gamma prior and three different algorithms for estimating the parameters in the model. It turns out that the rate parameter in the gamma distribution can be estimated on the basis of a single SAGE library, whereas the estimate of the shape parameter becomes unstable. This means that the number of zero counts cannot be estimated reliably. When a bivariate model is applied to two SAGE libraries, however, the number of predicted zero counts becomes more stable and in approximate agreement with the number of transcripts observed across a large number of experiments. In all the libraries we analyzed there was a small population of very highly expressed tags, typically 1% of the tags, that could not be accounted for by the model. To handle those tags we chose to augment our model with a non-parametric component. We also show some results based on a log-normal distribution instead of the gamma distribution.
By modeling SAGE data with a hierarchical Poisson model it is possible to separate the sampling variance from the variance in gene expression. If expression levels are reported at the gene level rather than at the tag level, genes mapped to multiple tags must be kept separate, since their expression levels show a different statistical behavior. A log-normal prior provided a better fit to our data than the gamma prior, but except for a small subpopulation of tags with very high counts, the two priors are similar.
由于样本中分子数量有限,基因表达序列分析(SAGE)在离散尺度上产生基因表达测量值。这意味着SAGE数据中的部分方差应被理解为二项分布或泊松分布中的抽样误差,而其他方差来源,特别是生物学方差,应使用连续分布函数进行建模,即泊松分布强度的先验分布。一个挑战是,这样的模型会预测大量计数为零的基因,而这些基因是无法观测到的。
我们提出了一种具有伽马先验的分层泊松模型以及三种不同的算法来估计模型中的参数。结果表明,伽马分布中的速率参数可以基于单个SAGE文库进行估计,而形状参数的估计则变得不稳定。这意味着无法可靠地估计计数为零的数量。然而,当将双变量模型应用于两个SAGE文库时,预测的计数为零的数量变得更加稳定,并且与大量实验中观察到的转录本数量大致一致。在我们分析的所有文库中,都有一小部分表达量非常高的标签,通常占标签总数的1%,无法用该模型解释。为了处理这些标签,我们选择用一个非参数组件来扩充我们的模型。我们还展示了一些基于对数正态分布而非伽马分布的结果。
通过使用分层泊松模型对SAGE数据进行建模,可以将抽样方差与基因表达方差区分开来。如果在基因水平而非标签水平报告表达水平,则映射到多个标签的基因必须分开处理,因为它们的表达水平表现出不同的统计行为。对数正态先验比伽马先验更适合我们的数据,但除了一小部分计数非常高的标签子群体外,这两种先验相似。