Chiu Chia-Chun, Chan Shih-Yao, Wang Chung-Ching, Wu Wei-Sheng
BMC Syst Biol. 2013;7 Suppl 6(Suppl 6):S12. doi: 10.1186/1752-0509-7-S6-S12. Epub 2013 Dec 13.
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used.
In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets.
In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
由于各种原因,微阵列数据通常充斥着缺失值。然而,大多数微阵列数据的下游分析需要完整的数据集。因此,需要准确的缺失值估计算法来提高微阵列数据分析的性能。尽管已经开发了许多算法,但在选择最优算法方面仍存在许多争议。关于不同算法性能比较的研究仍然不全面,特别是在使用的基准数据集数量、比较的算法数量、进行的模拟轮数以及使用的性能度量方面。
在本文中,我们通过使用(I)13个数据集、(II)9种算法、(III)110次独立模拟运行以及(IV)三种类型的度量进行了全面比较,以公平地评估每种插补算法的性能。首先,评估了不同类型的微阵列数据集对每种插补算法性能的影响。其次,我们讨论了来自不同物种的数据集对不同算法性能是否有不同影响。为了公平地评估每种算法的性能,所有评估均使用三种类型的度量进行。我们的结果表明,插补算法的性能主要取决于数据集的类型,而不是样本所来自的物种。除了统计度量外,另外两种具有生物学意义的度量对于反映缺失值插补对下游数据分析的影响很有用。我们的研究表明,基于局部最小二乘法的方法是处理大多数微阵列数据集缺失值的不错选择。
在这项工作中,我们对微阵列缺失值插补算法进行了全面比较。基于这样的全面比较,研究人员可以轻松地为其数据集选择最优算法。此外,新的插补算法可以使用此比较策略作为标准协议与现有算法进行比较。此外,为了帮助研究人员轻松处理缺失值,我们构建了一个基于网络且易于使用的插补工具MissVIA(http://cosbi.ee.ncku.edu.tw/MissVIA),它支持多种插补算法。一旦用户上传真实的微阵列数据集并选择插补算法,MissVIA将通过一系列模拟为用户的数据确定最优算法,然后可以下载插补结果用于下游数据分析。