Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN 55905, USA.
BMC Genomics. 2012 Jul 7;13:304. doi: 10.1186/1471-2164-13-304.
mRNA expression data from next generation sequencing platforms is obtained in the form of counts per gene or exon. Counts have classically been assumed to follow a Poisson distribution in which the variance is equal to the mean. The Negative Binomial distribution which allows for over-dispersion, i.e., for the variance to be greater than the mean, is commonly used to model count data as well.
In mRNA-Seq data from 25 subjects, we found technical variation to generally follow a Poisson distribution as has been reported previously and biological variability was over-dispersed relative to the Poisson model. The mean-variance relationship across all genes was quadratic, in keeping with a Negative Binomial (NB) distribution. Over-dispersed Poisson and NB distributional assumptions demonstrated marked improvements in goodness-of-fit (GOF) over the standard Poisson model assumptions, but with evidence of over-fitting in some genes. Modeling of experimental effects improved GOF for high variance genes but increased the over-fitting problem.
These conclusions will guide development of analytical strategies for accurate modeling of variance structure in these data and sample size determination which in turn will aid in the identification of true biological signals that inform our understanding of biological systems.
下一代测序平台的 mRNA 表达数据以每个基因或外显子的计数形式获得。传统上,计数被假定遵循泊松分布,其中方差等于均值。负二项分布允许过度分散,即方差大于均值,也常用于对计数数据进行建模。
在 25 名受试者的 mRNA-Seq 数据中,我们发现技术变异通常遵循泊松分布,如先前报道的那样,并且与泊松模型相比,生物变异性呈过度分散。所有基因的均值-方差关系呈二次曲线,符合负二项式(NB)分布。与标准泊松模型假设相比,过度分散的泊松和 NB 分布假设显著改善了拟合优度(GOF),但在某些基因中存在过度拟合的证据。实验效应的建模提高了高方差基因的拟合优度,但增加了过度拟合问题。
这些结论将指导针对这些数据中方差结构的准确建模和样本量确定的分析策略的开发,这反过来将有助于识别真实的生物学信号,从而帮助我们理解生物系统。