使用截断伽马-泊松模型对Sage数据进行建模。

Modeling Sage data with a truncated gamma-Poisson model.

作者信息

Thygesen Helene H, Zwinderman Aeilko H

机构信息

Clinical Epidemiology and Biostatistics, Academisch Medisch Centrum, University of Amsterdam, Meibergdreef 9, 1100 DD Amsterdam, The Netherlands.

出版信息

BMC Bioinformatics. 2006 Mar 20;7:157. doi: 10.1186/1471-2105-7-157.

DOI:10.1186/1471-2105-7-157

PMID:16549008

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1479844/

Abstract

BACKGROUND

Serial Analysis of Gene Expressions (SAGE) produces gene expression measurements on a discrete scale, due to the finite number of molecules in the sample. This means that part of the variance in SAGE data should be understood as the sampling error in a binomial or Poisson distribution, whereas other variance sources, in particular biological variance, should be modeled using a continuous distribution function, i.e. a prior on the intensity of the Poisson distribution. One challenge is that such a model predicts a large number of genes with zero counts, which cannot be observed.

RESULTS

We present a hierarchical Poisson model with a gamma prior and three different algorithms for estimating the parameters in the model. It turns out that the rate parameter in the gamma distribution can be estimated on the basis of a single SAGE library, whereas the estimate of the shape parameter becomes unstable. This means that the number of zero counts cannot be estimated reliably. When a bivariate model is applied to two SAGE libraries, however, the number of predicted zero counts becomes more stable and in approximate agreement with the number of transcripts observed across a large number of experiments. In all the libraries we analyzed there was a small population of very highly expressed tags, typically 1% of the tags, that could not be accounted for by the model. To handle those tags we chose to augment our model with a non-parametric component. We also show some results based on a log-normal distribution instead of the gamma distribution.

CONCLUSION

By modeling SAGE data with a hierarchical Poisson model it is possible to separate the sampling variance from the variance in gene expression. If expression levels are reported at the gene level rather than at the tag level, genes mapped to multiple tags must be kept separate, since their expression levels show a different statistical behavior. A log-normal prior provided a better fit to our data than the gamma prior, but except for a small subpopulation of tags with very high counts, the two priors are similar.

摘要

背景

由于样本中分子数量有限，基因表达序列分析（SAGE）在离散尺度上产生基因表达测量值。这意味着SAGE数据中的部分方差应被理解为二项分布或泊松分布中的抽样误差，而其他方差来源，特别是生物学方差，应使用连续分布函数进行建模，即泊松分布强度的先验分布。一个挑战是，这样的模型会预测大量计数为零的基因，而这些基因是无法观测到的。

结果

我们提出了一种具有伽马先验的分层泊松模型以及三种不同的算法来估计模型中的参数。结果表明，伽马分布中的速率参数可以基于单个SAGE文库进行估计，而形状参数的估计则变得不稳定。这意味着无法可靠地估计计数为零的数量。然而，当将双变量模型应用于两个SAGE文库时，预测的计数为零的数量变得更加稳定，并且与大量实验中观察到的转录本数量大致一致。在我们分析的所有文库中，都有一小部分表达量非常高的标签，通常占标签总数的1%，无法用该模型解释。为了处理这些标签，我们选择用一个非参数组件来扩充我们的模型。我们还展示了一些基于对数正态分布而非伽马分布的结果。

结论

通过使用分层泊松模型对SAGE数据进行建模，可以将抽样方差与基因表达方差区分开来。如果在基因水平而非标签水平报告表达水平，则映射到多个标签的基因必须分开处理，因为它们的表达水平表现出不同的统计行为。对数正态先验比伽马先验更适合我们的数据，但除了一小部分计数非常高的标签子群体外，这两种先验相似。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7708/1479844/69a8efab3bad/1471-2105-7-157-1.jpg

相似文献

Modeling Sage data with a truncated gamma-Poisson model.

BMC Bioinformatics. 2006 Mar 20;7:157. doi: 10.1186/1471-2105-7-157.

Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model.

BMC Bioinformatics. 2007 Aug 2;8:282. doi: 10.1186/1471-2105-8-282.

Statistical modeling of sequencing errors in SAGE libraries.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i31-9. doi: 10.1093/bioinformatics/bth924.

Correction of sequence-based artifacts in serial analysis of gene expression.

Bioinformatics. 2004 May 22;20(8):1254-63. doi: 10.1093/bioinformatics/bth077. Epub 2004 Feb 10.

Moderated statistical tests for assessing differences in tag abundance.

Bioinformatics. 2007 Nov 1;23(21):2881-7. doi: 10.1093/bioinformatics/btm453. Epub 2007 Sep 19.

Differential expression in SAGE: accounting for normal between-library variation.

Bioinformatics. 2003 Aug 12;19(12):1477-83. doi: 10.1093/bioinformatics/btg173.

Can transcriptome size be estimated from SAGE catalogs?

Bioinformatics. 2003 Mar 1;19(4):443-8. doi: 10.1093/bioinformatics/btg018.

Modeling SAGE tag formation and its effects on data interpretation within a Bayesian framework.

BMC Bioinformatics. 2007 Oct 18;8:403. doi: 10.1186/1471-2105-8-403.

A comparative analysis of the information content in long and short SAGE libraries.

BMC Bioinformatics. 2006 Nov 16;7:504. doi: 10.1186/1471-2105-7-504.

Clustering analysis of SAGE transcription profiles using a Poisson approach.

Methods Mol Biol. 2008;387:185-98. doi: 10.1007/978-1-59745-454-4_14.

引用本文的文献

Quantifying the impact of inter-site heterogeneity on the distribution of ChIP-seq data.

Front Genet. 2014 Nov 14;5:399. doi: 10.3389/fgene.2014.00399. eCollection 2014.

A Bayesian Semi-parametric Approach for the Differential Analysis of Sequence Counts Data.

J R Stat Soc Ser C Appl Stat. 2014 Apr;63(3):385-404. doi: 10.1111/rssc.12041.

Universal count correction for high-throughput sequencing.

PLoS Comput Biol. 2014 Mar 6;10(3):e1003494. doi: 10.1371/journal.pcbi.1003494. eCollection 2014 Mar.

A Poisson hierarchical modelling approach to detecting copy number variation in sequence coverage data.

BMC Genomics. 2013 Feb 26;14:128. doi: 10.1186/1471-2164-14-128.

Estimating species richness by a Poisson-compound gamma model.

Biometrika. 2010 Sep;97(3):727-740. doi: 10.1093/biomet/asq026. Epub 2010 Jun 22.

Bayesian Modeling of MPSS Data: Gene Expression Analysis of Bovine Salmonella Infection.

J Am Stat Assoc. 2010 Sep 1;105(491):956-967. doi: 10.1198/jasa.2010.ap08327.

Statistical design and analysis of RNA sequencing data.

Genetics. 2010 Jun;185(2):405-16. doi: 10.1534/genetics.110.114983. Epub 2010 May 3.

Bias correction and Bayesian analysis of aggregate counts in SAGE libraries.

BMC Bioinformatics. 2010 Feb 3;11:72. doi: 10.1186/1471-2105-11-72.

Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms.

Nucleic Acids Res. 2008 Dec;36(21):e141. doi: 10.1093/nar/gkn705. Epub 2008 Oct 15.

Clustering-based approaches to SAGE data mining.

BioData Min. 2008 Jul 17;1(1):5. doi: 10.1186/1756-0381-1-5.

本文引用的文献

Incidence of "quasi-ditags" in catalogs generated by Serial Analysis of Gene Expression (SAGE).

BMC Bioinformatics. 2004 Oct 18;5:152. doi: 10.1186/1471-2105-5-152.

Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE).

BMC Bioinformatics. 2004 Aug 31;5:119. doi: 10.1186/1471-2105-5-119.

Statistical modeling of sequencing errors in SAGE libraries.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i31-9. doi: 10.1093/bioinformatics/bth924.

Clustering analysis of SAGE data using a Poisson approach.

Genome Biol. 2004;5(7):R51. doi: 10.1186/gb-2004-5-7-r51. Epub 2004 Jun 29.

Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE.

Biometrics. 2003 Sep;59(3):476-86. doi: 10.1111/1541-0420.00057.

Differential expression in SAGE: accounting for normal between-library variation.

Bioinformatics. 2003 Aug 12;19(12):1477-83. doi: 10.1093/bioinformatics/btg173.

Can transcriptome size be estimated from SAGE catalogs?

Bioinformatics. 2003 Mar 1;19(4):443-8. doi: 10.1093/bioinformatics/btg018.

Statistical evaluation of SAGE libraries: consequences for experimental design.

Physiol Genomics. 2002 Oct 29;11(2):37-44. doi: 10.1152/physiolgenomics.00042.2002.

SAGE Genie: a suite with panoramic view of gene expression.

Proc Natl Acad Sci U S A. 2002 Sep 3;99(18):11547-8. doi: 10.1073/pnas.192436299. Epub 2002 Aug 23.

General statistics of stochastic process of gene expression in eukaryotic cells.

Genetics. 2002 Jul;161(3):1321-32. doi: 10.1093/genetics/161.3.1321.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用截断伽马-泊松模型对Sage数据进行建模。

Modeling Sage data with a truncated gamma-Poisson model.

作者信息

Thygesen Helene H, Zwinderman Aeilko H

机构信息

Clinical Epidemiology and Biostatistics, Academisch Medisch Centrum, University of Amsterdam, Meibergdreef 9, 1100 DD Amsterdam, The Netherlands.