Department of Statistics, Begum Rokeya University, Rangpur, 5400, Bangladesh.
Queensland Brain Institute, The University of Queensland, Brisbane, QLD 4072, Australia.
Comput Biol Med. 2021 Nov;138:104911. doi: 10.1016/j.compbiomed.2021.104911. Epub 2021 Sep 29.
Transcriptomics and metabolomics data often contain missing values or outliers due to limitations of the data acquisition techniques. Most of the statistical methods require complete datasets for downstream analysis. A number of methods have been developed for missing value imputation using the classical mean and variance based on maximum likelihood estimators, which are not robust against outliers. Consequently, the performance of these methods deteriorates in the presence of outliers. Hence precise imputation of missing values and outliers handling are both concurrently important. Therefore, in this paper, we developed a robust iterative approach using robust estimators based on the minimum beta divergence method, which simultaneously impute missing values and outliers. We investigate the performance of the proposed method in a comparison with six frequently used missing value imputation methods such as Zero, KNN, robust SVD, EM, random forest (RF) and weighted least square approach (WLSA) through feature selection using both simulated and real datasets. Ten performance indices were used to explore the optimal method such as Frobenius norm (FOBN), accuracy (ACC), sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), detection rate (DR), misclassification error rate (MER), the area under the ROC curve (AUC) and computational runtime. Evaluation based on both simulated and real data suggests the superiority of the proposed method over the other traditional methods in terms of various rates of outliers and missing values. The suggested approach also keeps almost equal performance in absence of outliers with the other methods. The proposed method is accurate, simple, and consumes lower computational time compared to the other methods. Therefore, our recommendation is to apply the proposed procedure for large-scale transcriptomics and metabolomics data analysis. The computational tool has been implemented in an R package, which is publicly available from https://CRAN.R-project.org/package=rMisbeta.
转录组学和代谢组学数据通常由于数据采集技术的限制而包含缺失值或异常值。大多数统计方法都需要完整的数据集进行下游分析。已经开发了许多基于最大似然估计的经典均值和方差的缺失值插补方法,但它们对异常值不稳健。因此,在存在异常值的情况下,这些方法的性能会恶化。因此,缺失值的精确插补和异常值处理都同等重要。因此,在本文中,我们开发了一种基于最小β散度方法的稳健迭代方法,该方法可以同时插补缺失值和异常值。我们通过使用模拟数据集和真实数据集进行特征选择,将提出的方法与零、KNN、稳健 SVD、EM、随机森林 (RF) 和加权最小二乘法 (WLSA) 等六种常用的缺失值插补方法进行比较,评估了该方法的性能。使用 Frobenius 范数 (FOBN)、准确性 (ACC)、灵敏度 (SN)、特异性 (SP)、阳性预测值 (PPV)、阴性预测值 (NPV)、检测率 (DR)、误分类错误率 (MER)、ROC 曲线下面积 (AUC) 和计算运行时间等十个性能指标来探索最佳方法。基于模拟和真实数据的评估表明,与其他传统方法相比,该方法在各种异常值和缺失值比率下具有优越性。在不存在异常值的情况下,该方法的性能与其他方法几乎相同。与其他方法相比,该方法具有准确性高、简单、计算时间消耗低等优点。因此,我们建议将提出的方法应用于大规模转录组学和代谢组学数据分析。该计算工具已在 R 包中实现,可从 https://CRAN.R-project.org/package=rMisbeta 获得。