Department of Biostatistics and Bioinformatics, 1518 Clifton Rd., N.E., 3rd Floor, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2011 May-Jun;8(3):723-31. doi: 10.1109/TCBB.2010.73.
Microarray gene expression data often contain missing values. Accurate estimation of the missing values is important for downstream data analyses that require complete data. Nonlinear relationships between gene expression levels have not been well-utilized in missing value imputation. We propose an imputation scheme based on nonlinear dependencies between genes. By simulations based on real microarray data, we show that incorporating nonlinear relationships could improve the accuracy of missing value imputation, both in terms of normalized root-mean-squared error and in terms of the preservation of the list of significant genes in statistical testing. In addition, we studied the impact of artificial dependencies introduced by data normalization on the simulation results. Our results suggest that methods relying on global correlation structures may yield overly optimistic simulation results when the data have been subjected to row (gene)-wise mean removal.
微阵列基因表达数据通常包含缺失值。对于需要完整数据的下游数据分析,准确估计缺失值非常重要。在缺失值插补中,基因表达水平之间的非线性关系尚未得到很好的利用。我们提出了一种基于基因之间非线性关系的插补方案。通过基于真实微阵列数据的模拟,我们表明,纳入非线性关系可以提高缺失值插补的准确性,无论是在归一化均方根误差方面,还是在统计检验中保留显著基因列表方面。此外,我们研究了数据归一化引入的人工依赖性对模拟结果的影响。我们的结果表明,当数据已经经过行(基因)均值去除时,依赖于全局相关结构的方法可能会产生过于乐观的模拟结果。