Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA.
Bioinformatics. 2011 Jan 1;27(1):78-86. doi: 10.1093/bioinformatics/btq613. Epub 2010 Nov 2.
Microarray experiments frequently produce multiple missing values (MVs) due to flaws such as dust, scratches, insufficient resolution or hybridization errors on the chips. Unfortunately, many downstream algorithms require a complete data matrix. The motivation of this work is to determine the impact of MV imputation on downstream analysis, and whether ranking of imputation methods by imputation accuracy correlates well with the biological impact of the imputation.
Using eight datasets for differential expression (DE) and classification analysis and eight datasets for gene clustering, we demonstrate the biological impact of missing-value imputation on statistical downstream analyses, including three commonly employed DE methods, four classifiers and three gene-clustering methods. Correlation between the rankings of imputation methods based on three root-mean squared error (RMSE) measures and the rankings based on the downstream analysis methods was used to investigate which RMSE measure was most consistent with the biological impact measures, and which downstream analysis methods were the most sensitive to the choice of imputation procedure.
DE was the most sensitive to the choice of imputation procedure, while classification was the least sensitive and clustering was intermediate between the two. The logged RMSE (LRMSE) measure had the highest correlation with the imputation rankings based on the DE results, indicating that the LRMSE is the best representative surrogate among the three RMSE-based measures. Bayesian principal component analysis and least squares adaptive appeared to be the best performing methods in the empirical downstream evaluation.
微阵列实验由于芯片上的灰尘、划痕、分辨率不足或杂交错误等缺陷,经常会产生多个缺失值(MVs)。不幸的是,许多下游算法都需要一个完整的数据矩阵。这项工作的动机是确定缺失值插补对下游分析的影响,以及插补准确性对插补方法的排名是否与插补的生物学影响很好地相关。
使用八个用于差异表达(DE)和分类分析的数据集和八个用于基因聚类的数据集,我们展示了缺失值插补对统计下游分析的生物学影响,包括三种常用的 DE 方法、四种分类器和三种基因聚类方法。基于三个均方根误差(RMSE)度量的插补方法的排名与基于下游分析方法的排名之间的相关性用于研究哪种 RMSE 度量与生物学影响度量最一致,以及哪种下游分析方法对插补程序的选择最敏感。
DE 对插补程序的选择最敏感,而分类最不敏感,聚类介于两者之间。对数 RMSE(LRMSE)度量与基于 DE 结果的插补排名相关性最高,表明 LRMSE 是三个 RMSE 度量中最好的代表替代物。贝叶斯主成分分析和最小二乘自适应在经验性下游评估中似乎表现最好。