INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire, DSIMB, Université Paris Diderot-Paris 7, 2 place Jussieu, Paris, France.
BMC Genomics. 2010 Jan 7;11:15. doi: 10.1186/1471-2164-11-15.
Microarray technologies produced large amount of data. In a previous study, we have shown the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human.
We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (EM_array). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that k-means approach is more efficient to conserve gene associations.
More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.
微阵列技术产生了大量的数据。在之前的研究中,我们已经展示了 k-最近邻方法在恢复缺失基因表达值方面的优势,以及它对层次算法的基因聚类的积极影响。从那时起,已经提出了许多替换方法来填补微阵列数据中的缺失值 (MVs)。在这项研究中,我们评估了 12 种不同的可用方法,以及它们对基因聚类质量的影响。有趣的是,我们使用了多个数据集,包括来自酵母和人类的动力学和非动力学实验。
我们强调了 Bo 及其同事提出和实现的方法的卓越效率,特别是基于期望最大化 (EM_array) 的方法。这些改进也在对极端值(最难以预测的值)的插补方面得到了观察。我们表明,插补的 MV 对基因聚类的稳定性仍然有重要影响。通过层次聚类获得的聚类改进仍然有限,不足以完全恢复正确的基因关联。然而,可以发现,在插补方法的质量和基因聚类稳定性之间存在共同的趋势。即使对聚类算法进行比较是一项复杂的任务,我们也观察到 k-means 方法在保留基因关联方面更有效。
对五个非常不同的生物学数据集的 600 多万个独立模拟评估了 12 种插补方法的质量。自我们上次研究以来,已经取得了重要的改进。EM_array 方法是恢复缺失表达基因值的一种有效方法,具有更低的估计误差水平。尽管如此,即使在低比率下存在 MV 也是基因聚类不稳定的一个主要因素。我们的研究强调了对插补方法进行系统评估的必要性,因此需要专用的基准。一个值得注意的问题是一些生物学数据集的特殊影响。