Strimmer Korbinian
Department of Statistics, University of Munich, Ludwigstrasse 33, D-80539 Munich, Germany.
BMC Bioinformatics. 2003 Mar 20;4:10. doi: 10.1186/1471-2105-4-10.
Using suitable error models for gene expression measurements is essential in the statistical analysis of microarray data. However, the true probabilistic model underlying gene expression intensity readings is generally not known. Instead, in currently used approaches some simple parametric model is assumed (usually a transformed normal distribution) or the empirical distribution is estimated. However, both these strategies may not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further related problem is the choice of a suitable scale for the model (e.g. observed vs. log-scale).
Here a simple semi-parametric model for gene expression measurement error is presented. In this approach inference is based an approximate likelihood function (the extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated (e.g. quadratic) variance structure of the data. As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain corresponding approximate confidence intervals. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale. It can also be employed in regression approaches to model systematic (e.g. array or dye) effects.
The quasi-likelihood framework provides a simple and versatile approach to analyze gene expression data that does not make any strong distributional assumptions about the underlying error model. For several simulated as well as real data sets it provides a better fit to the data than competing models. In an example it also improved the power of tests to identify differential expression.
在微阵列数据的统计分析中,使用合适的误差模型对基因表达测量至关重要。然而,基因表达强度读数背后的真实概率模型通常是未知的。相反,在当前使用的方法中,会假设一些简单的参数模型(通常是变换后的正态分布)或估计经验分布。然而,这两种策略对于基因表达数据可能都不是最优的,因为非参数方法忽略了已知的结构信息,而完全参数模型存在误设的风险。另一个相关问题是模型合适尺度的选择(例如观察尺度与对数尺度)。
本文提出了一种用于基因表达测量误差的简单半参数模型。在这种方法中,推断基于近似似然函数(扩展拟似然)。构建此函数仅需要关于未知真实分布的部分知识。对于基因表达而言,这些信息以数据假定的(例如二次)方差结构的形式存在。由于拟似然(几乎)表现得像一个恰当的似然,它允许估计校准和方差参数,并且也很容易获得相应的近似置信区间。与大多数其他框架不同,它还允许在任何首选尺度上进行分析,即在原始线性尺度以及变换尺度上。它也可用于回归方法以对系统(例如阵列或染料)效应进行建模。
拟似然框架提供了一种简单且通用的方法来分析基因表达数据,该方法不对潜在误差模型做任何强分布假设。对于几个模拟数据集以及真实数据集,它比竞争模型能更好地拟合数据。在一个例子中,它还提高了识别差异表达的检验功效。