Scheel Ida, Aldrin Magne, Glad Ingrid K, Sørum Ragnhild, Lyng Heidi, Frigessi Arnoldo
Department of Mathematics, University of Oslo PO Box 1053, Blindern, NO-0316 Oslo, Norway.
Bioinformatics. 2005 Dec 1;21(23):4272-9. doi: 10.1093/bioinformatics/bti708. Epub 2005 Oct 10.
Missing values are problematic for the analysis of microarray data. Imputation methods have been compared in terms of the similarity between imputed and true values in simulation experiments and not of their influence on the final analysis. The focus has been on missing at random, while entries are missing also not at random.
We investigate the influence of imputation on the detection of differentially expressed genes from cDNA microarray data. We apply ANOVA for microarrays and SAM and look to the differentially expressed genes that are lost because of imputation. We show that this new measure provides useful information that the traditional root mean squared error cannot capture. We also show that the type of missingness matters: imputing 5% missing not at random has the same effect as imputing 10-30% missing at random. We propose a new method for imputation (LinImp), fitting a simple linear model for each channel separately, and compare it with the widely used KNNimpute method. For 10% missing at random, KNNimpute leads to twice as many lost differentially expressed genes as LinImp.
The R package for LinImp is available at http://folk.uio.no/idasch/imp.
缺失值对于微阵列数据分析来说是个难题。在模拟实验中,已对插补方法按照插补值与真实值之间的相似度进行了比较,而非依据它们对最终分析的影响。重点一直放在随机缺失上,然而数据条目也会出现非随机缺失。
我们研究了插补对从cDNA微阵列数据中检测差异表达基因的影响。我们将方差分析应用于微阵列以及SAM,并关注因插补而丢失的差异表达基因。我们表明,这种新方法能提供传统均方根误差无法获取的有用信息。我们还表明,缺失类型很重要:非随机插补5%的缺失值与随机插补10% - 30%的缺失值具有相同效果。我们提出了一种新的插补方法(线性插补法),分别为每个通道拟合一个简单线性模型,并将其与广泛使用的K近邻插补法进行比较。对于随机缺失10%的数据,K近邻插补法导致丢失的差异表达基因数量是线性插补法的两倍。
线性插补法的R包可在http://folk.uio.no/idasch/imp获取。