University of Hawaii Cancer Center, Honolulu, HI, 96813, USA.
Department of Molecular Biosciences and Bioengineering, University of Hawaii at Manoa, Honolulu, HI, 96822, USA.
Sci Rep. 2018 Jan 12;8(1):663. doi: 10.1038/s41598-017-19120-0.
Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).
在基于质谱(MS)的代谢组学数据中,缺失值广泛存在。已经应用了各种方法来处理缺失值,但选择方法会对后续数据分析产生重大影响。通常,缺失值有三种类型,分别是:非随机缺失(MNAR)、随机缺失(MAR)和完全随机缺失(MCAR)。我们的研究综合比较了八种插补方法(零值插补、半最小值插补(HM)、均值插补、中位数插补、随机森林(RF)插补、奇异值分解(SVD)插补、k-最近邻(kNN)插补和左截断数据的分位数回归插补(QRILC)),用于四种代谢组学数据集的不同类型缺失值。归一化均方根误差(NRMSE)和基于 NRMSE 的秩和(SOR)被用于评估插补准确性。主成分分析(PCA)/偏最小二乘(PLS)-普罗克拉斯分析用于评估总体样本分布。学生 t 检验和相关分析用于评估对单变量统计的影响。研究结果表明,RF 在 MCAR/MAR 情况下表现最好,QRILC 是左截断 MNAR 的首选方法。最后,我们提出了一种综合策略,并开发了一个公共访问的网络工具,用于代谢组学中的缺失值插补应用(https://metabolomics.cc.hawaii.edu/software/MetImp/)。