Tuikkala Johannes, Elo Laura L, Nevalainen Olli S, Aittokallio Tero
Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
BMC Bioinformatics. 2008 Apr 18;9:202. doi: 10.1186/1471-2105-9-202.
Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can - up to a certain degree - be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).
缺失值在基因表达微阵列实验中经常带来问题,因为它们可能会妨碍数据集的下游分析。虽然微阵列用户可以使用几种缺失值插补方法,并且新方法也在不断开发,但对于如何在不同方法之间进行选择尚无普遍共识,因为它们的性能似乎会因所使用的数据集而有很大差异。
我们表明,这种差异主要可归因于传统上开发和评估插补方法的方式。通过在最近的微阵列数据集上比较多种先进的插补方法,我们表明,即使各数据集在测量水平的插补准确性上存在显著差异,但当根据这些方法在重现原始基因簇或其生物学解释方面的表现来评估时,这些差异就变得微不足道了。然而,无论采用何种评估方法,插补总是比忽略缺失数据点或将其替换为零或平均值能得到更好的结果,这强调了使用更先进插补方法的持续重要性。
结果表明,虽然缺失值仍然严重使微阵列数据分析复杂化,但通过使用现成且相对快速的插补方法,如贝叶斯主成分算法(BPCA),在一定程度上可以减少它们对发现具有生物学意义的基因组的影响。