Brás Lígia P, Menezes José C
Centre for Chemical & Biological Engineering, Department of Chemical and Biological Engineering, IST, Technical University of Lisbon, Av. Rovisco Pais, P-1049-001 Lisbon, Portugal.
Biomol Eng. 2007 Jun;24(2):273-82. doi: 10.1016/j.bioeng.2007.04.003. Epub 2007 Apr 19.
We present a modification of the weighted K-nearest neighbours imputation method (KNNimpute) for missing values (MVs) estimation in microarray data based on the reuse of estimated data. The method was called iterative KNN imputation (IKNNimpute) as the estimation is performed iteratively using the recently estimated values. The estimation efficiency of IKNNimpute was assessed under different conditions (data type, fraction and structure of missing data) by the normalized root mean squared error (NRMSE) and the correlation coefficients between estimated and true values, and compared with that of other cluster-based estimation methods (KNNimpute and sequential KNN). We further investigated the influence of imputation on the detection of differentially expressed genes using SAM by examining the differentially expressed genes that are lost after MV estimation. The performance measures give consistent results, indicating that the iterative procedure of IKNNimpute can enhance the prediction ability of cluster-based methods in the presence of high missing rates, in non-time series experiments and in data sets comprising both time series and non-time series data, because the information of the genes having MVs is used more efficiently and the iterative procedure allows refining the MV estimates. More importantly, IKNN has a smaller detrimental effect on the detection of differentially expressed genes.
我们基于估计数据的重用,提出了一种加权K近邻插补法(KNNimpute)的改进方法,用于估计微阵列数据中的缺失值(MVs)。该方法被称为迭代KNN插补法(IKNNimpute),因为估计是使用最近估计的值迭代进行的。通过归一化均方根误差(NRMSE)以及估计值与真实值之间的相关系数,在不同条件下(数据类型、缺失数据的比例和结构)评估IKNNimpute的估计效率,并与其他基于聚类的估计方法(KNNimpute和顺序KNN)进行比较。我们通过检查MV估计后丢失的差异表达基因,进一步研究了插补对使用SAM检测差异表达基因的影响。性能指标给出了一致的结果,表明在高缺失率情况下、在非时间序列实验中以及在包含时间序列和非时间序列数据的数据集中,IKNNimpute的迭代过程可以提高基于聚类方法的预测能力,因为具有MVs的基因信息得到了更有效的利用,并且迭代过程允许细化MV估计。更重要的是,IKNN对差异表达基因的检测具有较小的不利影响。