Gusnanto Arief, Ploner Alexander, Pawitan Yudi
Medical Research Council-Biostatistics Unit, Institute of Public Health, Cambridge CB2 2SR, United Kingdom.
Stat Appl Genet Mol Biol. 2005;4:Article26. doi: 10.2202/1544-6115.1145. Epub 2005 Sep 21.
Microarray experiments produce expression measurements for thousands of genes simultaneously, though usually for a small number of RNA samples. The most common problem is the identification of genes that are differentially expressed between different groups of samples or biological conditions. As the number of genes far exceeds the number of RNA samples, the inherent multiplicity poses a severe problem in both hypothesis testing and effect estimation. While much of the recent literature is focused on the hypothesis aspects, we concentrate in this paper on effect estimation as a tool for the identification of differentially expressed genes. We propose a linear mixed model where the random effects are assumed to follow a mixture distribution, and study in detail the case of three normals, corresponding to genes that are down-, up- or non regulated. Our approach leads to a new type of non-linear shrinkage estimation, where a proportion of estimates is shrunk to zero, while the rest follows standard linear shrinkage. This allows us to estimate the log fold-change of the genes involved and to identify those that are differentially expressed within the same model framework. We investigate the operating characteristics of our method using simulation and spike-in studies, and illustrate its application to real data using a breast-cancer dataset.
微阵列实验可同时对数千个基因进行表达测量,不过通常针对的是少量RNA样本。最常见的问题是识别在不同样本组或生物学条件之间差异表达的基因。由于基因数量远远超过RNA样本数量,内在的多重性在假设检验和效应估计中都构成了严重问题。尽管近期的许多文献都聚焦于假设方面,但在本文中我们将重点放在效应估计上,将其作为识别差异表达基因的一种工具。我们提出一种线性混合模型,其中假定随机效应服从混合分布,并详细研究对应于下调、上调或无调控基因的三个正态分布的情况。我们的方法导致了一种新型的非线性收缩估计,其中一部分估计值被收缩至零,而其余部分遵循标准线性收缩。这使我们能够估计所涉及基因的对数变化倍数,并在同一模型框架内识别那些差异表达的基因。我们使用模拟和掺入研究来研究我们方法的操作特性,并使用乳腺癌数据集说明其在实际数据中的应用。