Baggerly Keith A, Deng Li, Morris Jeffrey S, Aldaz C Marcelo
Department of Biostatistics, UT M.D. Anderson Cancer Center, 1515 Holcombe Blvd, Box 447, Houston, TX 77030-4009, USA.
Bioinformatics. 2003 Aug 12;19(12):1477-83. doi: 10.1093/bioinformatics/btg173.
In contrasting levels of gene expression between groups of SAGE libraries, the libraries within each group are often combined and the counts for the tag of interest summed, and inference is made on the basis of these larger 'pseudolibraries'. While this captures the sampling variability inherent in the procedure, it fails to allow for normal variation in levels of the gene between individuals within the same group, and can consequently overstate the significance of the results. The effect is not slight: between-library variation can be hundreds of times the within-library variation.
We introduce a beta-binomial sampling model that correctly incorporates both sources of variation. We show how to fit the parameters of this model, and introduce a test statistic for differential expression similar to a two-sample t-test.
在比较SAGE文库组之间的基因表达水平时,通常会将每组内的文库合并,并对感兴趣标签的计数进行求和,然后基于这些更大的“伪文库”进行推断。虽然这捕捉到了该过程中固有的抽样变异性,但它没有考虑同一组内个体间基因水平的正常变异,因此可能会夸大结果的显著性。这种影响并不小:文库间变异可能是文库内变异的数百倍。
我们引入了一个β-二项式抽样模型,该模型正确地纳入了这两种变异来源。我们展示了如何拟合该模型的参数,并引入了一种类似于双样本t检验的差异表达检验统计量。