Brock Guy N, Shaffer John R, Blakesley Richard E, Lotz Meredith J, Tseng George C
Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, Universtiy of Louisville, Louisville, KY 40292, USA.
BMC Bioinformatics. 2008 Jan 10;9:12. doi: 10.1186/1471-2105-9-12.
Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set.
We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost.
Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
基因表达数据常常包含缺失值,然而,大多数针对微阵列实验的下游分析都需要完整的数据。在文献中,已经提出了许多方法来通过基因表达矩阵内的相关模式信息估计缺失值。每种方法都有其自身的优点,但每种方法更适用的具体条件在很大程度上仍不明确。在本报告中,我们描述了对当前八种插补方法在多种类型微阵列实验上的广泛评估,这些实验包括时间序列、多次暴露以及多次暴露×时间序列数据。然后我们引入了两种互补的选择方案,用于为任何给定数据集确定最合适的插补方法。
我们发现最优插补算法(LSA、LLS和BPCA)彼此之间都具有很强的竞争力,并且在我们研究的所有数据集中没有一种方法在各方面都表现最优。每种方法的成功还可能取决于表达数据潜在的“复杂性”,我们将复杂性定义为将基因表达矩阵映射到低维子空间的难度。我们开发了一种熵度量来量化表达矩阵的复杂性,并发现通过纳入此信息,基于熵的选择(EBS)方案有助于选择合适的插补算法。我们进一步提出了一种基于模拟的自训练选择(STS)方案。该技术先前已用于微阵列数据插补,但目的不同。该方案能高精度地选择最优或接近最优的方法,但计算成本会增加。
我们的研究结果为针对给定数据集哪种插补方法最优这一问题提供了见解。三种表现最佳的方法(LSA、LLS和BPCA)相互竞争。基于全局的插补方法(PLS、SVD、BPCA)在复杂性较低的微阵列数据上表现更好,而基于邻域的方法(KNN、OLS、LSA、LLS)在复杂性较高的数据上表现更好。我们还发现EBS和STS方案是选择最优插补算法的互补且有效的工具。