Li Huihui, Zhao Changbo, Shao Fengfeng, Li Guo-Zheng, Wang Xiao
BMC Genomics. 2015;16 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2164-16-S9-S1. Epub 2015 Aug 17.
Missing data is an inevitable phenomenon in gene expression microarray experiments due to instrument failure or human error. It has a negative impact on performance of downstream analysis. Technically, most existing approaches suffer from this prevalent problem. Imputation is one of the frequently used methods for processing missing data. Actually many developments have been achieved in the research on estimating missing values. The challenging task is how to improve imputation accuracy for data with a large missing rate.
In this paper, induced by the thought of collaborative training, we propose a novel hybrid imputation method, called Recursive Mutual Imputation (RMI). Specifically, RMI exploits global correlation information and local structure in the data, captured by two popular methods, Bayesian Principal Component Analysis (BPCA) and Local Least Squares (LLS), respectively. Mutual strategy is implemented by sharing the estimated data sequences at each recursive process. Meanwhile, we consider the imputation sequence based on the number of missing entries in the target gene. Furthermore, a weight based integrated method is utilized in the final assembling step.
We evaluate RMI with three state-of-art algorithms (BPCA, LLS, Iterated Local Least Squares imputation (ItrLLS)) on four publicly available microarray datasets. Experimental results clearly demonstrate that RMI significantly outperforms comparative methods in terms of Normalized Root Mean Square Error (NRMSE), especially for datasets with large missing rates and less complete genes.
It is noted that our proposed hybrid imputation approach incorporates both global and local information of microarray genes, which achieves lower NRMSE values against to any single approach only. Besides, this study highlights the need for considering the imputing sequence of missing entries for imputation methods.
由于仪器故障或人为错误,缺失数据在基因表达微阵列实验中是不可避免的现象。它对下游分析的性能有负面影响。从技术上讲,大多数现有方法都存在这个普遍问题。插补是处理缺失数据常用的方法之一。实际上,在估计缺失值的研究中已经取得了许多进展。具有挑战性的任务是如何提高对缺失率高的数据的插补准确性。
在本文中,受协同训练思想的启发,我们提出了一种新颖的混合插补方法,称为递归互插补(RMI)。具体而言,RMI利用分别由两种流行方法贝叶斯主成分分析(BPCA)和局部最小二乘法(LLS)捕获的数据中的全局相关信息和局部结构。通过在每个递归过程中共享估计的数据序列来实现互插补策略。同时,我们根据目标基因中缺失条目的数量来考虑插补顺序。此外,在最终的组装步骤中使用基于权重的集成方法。
我们在四个公开可用的微阵列数据集上使用三种先进算法(BPCA、LLS、迭代局部最小二乘插补(ItrLLS))对RMI进行了评估。实验结果清楚地表明,在归一化均方根误差(NRMSE)方面,RMI明显优于比较方法,特别是对于缺失率高且完整基因较少的数据集。
需要注意的是,我们提出的混合插补方法结合了微阵列基因的全局和局部信息,相对于任何单一方法都能实现更低的NRMSE值。此外,本研究强调了插补方法需要考虑缺失条目的插补顺序。