Baggerly Keith A, Deng Li, Morris Jeffrey S, Aldaz C Marcelo
Department of Biostatistics and Applied Mathematics, UT M. D. Anderson Cancer Center, Houston, TX, USA.
BMC Bioinformatics. 2004 Oct 6;5:144. doi: 10.1186/1471-2105-5-144.
Two major identifiable sources of variation in data derived from the Serial Analysis of Gene Expression (SAGE) are within-library sampling variability and between-library heterogeneity within a group. Most published methods for identifying differential expression focus on just the sampling variability. In recent work, the problem of assessing differential expression between two groups of SAGE libraries has been addressed by introducing a beta-binomial hierarchical model that explicitly deals with both of the above sources of variation. This model leads to a test statistic analogous to a weighted two-sample t-test. When the number of groups involved is more than two, however, a more general approach is needed.
We describe how logistic regression with overdispersion supplies this generalization, carrying with it the framework for incorporating other covariates into the model as a byproduct. This approach has the advantage that logistic regression routines are available in several common statistical packages.
The described method provides an easily implemented tool for analyzing SAGE data that correctly handles multiple types of variation and allows for more flexible modelling.
基因表达序列分析(SAGE)数据中两个主要的可识别变异来源是文库内抽样变异性和组内文库间异质性。大多数已发表的用于识别差异表达的方法仅关注抽样变异性。在最近的工作中,通过引入一个明确处理上述两种变异来源的β-二项式层次模型,解决了评估两组SAGE文库之间差异表达的问题。该模型产生一个类似于加权双样本t检验的检验统计量。然而,当涉及的组数超过两个时,需要一种更通用的方法。
我们描述了具有过度离散的逻辑回归如何提供这种推广,并附带将其他协变量纳入模型的框架作为副产品。这种方法的优点是逻辑回归例程在几个常用的统计软件包中都可用。
所描述的方法为分析SAGE数据提供了一个易于实现的工具,该工具能正确处理多种类型的变异,并允许进行更灵活的建模。