School of Biological Sciences, Nanyang Technological University, Singapore.
Department of Computer Science, National University of Singapore, Singapore.
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad233.
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect.
缺失值(MVs)会对数据分析和机器学习模型的开发产生不利影响。我们提出了一种新的混合模型缺失值插补(MVI)方法。这个方法被称为 ProJect(蛋白质注射的缩写),与贝叶斯主成分分析(PCA)、概率 PCA、局部最小二乘法和左截断数据的分位数回归插补等现有 MVI 方法相比,是一个强大且有意义的改进。我们在各种高通量数据类型上严格测试了 ProJect,包括基因组学和基于质谱(MS)的蛋白质组学。具体来说,我们利用 DIA-SWATH 获得的肾细胞癌(RC)数据、DIA-MS 获得的卵巢癌(OC)数据、膀胱癌(BladderBatch)和胶质母细胞瘤(GBM)微阵列基因表达数据集。我们的结果表明,ProJect 始终优于其他参考 MVI 方法。它的归一化均方根误差(在 RC_C 中平均得分低 45.92%,在 RC_full 中低 27.37%,在 OC 中低 29.22%,在 BladderBatch 中低 23.65%,在 GBM 中低 20.20%,比最接近的竞争方法)和普罗克鲁斯和平方误差(Procrustes SS)(在 RC_C 中低 79.71%,在 RC_full 中低 38.36%,在 OC 中低 18.13%,在 BladderBatch 中低 74.74%,在 GBM 中低 30.79%,比下一个最佳方法)。ProJect 还在各种类型的 MV 组合中具有最高的相关系数(在 RC_C 中高 0.64%,在 RC_full 中高 0.24%,在 OC 中高 0.55%,在 BladderBatch 中高 0.39%,在 GBM 中高 0.27%,比表现第二好的方法)。ProJect 的主要优势在于它能够处理实际数据中常见的不同类型的 MV。与大多数旨在处理一种 MV 类型的 MVI 方法不同,ProJect 采用了一种决策算法,该算法首先确定 MV 是随机缺失还是非随机缺失。然后,它针对每种 MV 类型采用有针对性的插补策略,从而产生更准确和可靠的插补结果。ProJect 的 R 实现可在 https://github.com/miaomiao6606/ProJect 上获得。