Li Qian, Fisher Kate, Meng Wenjun, Fang Bin, Welsh Eric, Haura Eric B, Koomen John M, Eschrich Steven A, Fridley Brooke L, Chen Y Ann
Health Informatics Institute, University of South Florida, Tampa, FL, USA.
Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.
Bioinformatics. 2020 Jan 1;36(1):257-263. doi: 10.1093/bioinformatics/btz488.
Missingness in label-free mass spectrometry is inherent to the technology. A computational approach to recover missing values in metabolomics and proteomics datasets is important. Most existing methods are designed under a particular assumption, either missing at random or under the detection limit. If the missing pattern deviates from the assumption, it may lead to biased results. Hence, we investigate the missing patterns in free mass spectrometry data and develop an omnibus approach GMSimpute, to allow effective imputation accommodating different missing patterns.
Three proteomics datasets and one metabolomics dataset indicate missing values could be a mixture of abundance-dependent and abundance-independent missingness. We assess the performance of GMSimpute using simulated data (with a wide range of 80 missing patterns) and metabolomics data from the Cancer Genome Atlas breast cancer and clear cell renal cell carcinoma studies. Using Pearson correlation and normalized root mean square errors between the true and imputed abundance, we compare its performance to K-nearest neighbors' type approaches, Random Forest, GSimp, a model-based method implemented in DanteR and minimum values. The results indicate GMSimpute provides higher accuracy in imputation and exhibits stable performance across different missing patterns. In addition, GMSimpute is able to identify the features in downstream differential expression analysis with high accuracy when applied to the Cancer Genome Atlas datasets.
GMSimpute is on CRAN: https://cran.r-project.org/web/packages/GMSimpute/index.html.
Supplementary data are available at Bioinformatics online.
无标记质谱分析中的数据缺失是该技术所固有的。采用计算方法来恢复代谢组学和蛋白质组学数据集中的缺失值非常重要。现有的大多数方法都是在特定假设下设计的,要么是随机缺失,要么是低于检测限。如果缺失模式偏离该假设,可能会导致有偏差的结果。因此,我们研究了无标记质谱数据中的缺失模式,并开发了一种综合方法GMSimpute,以实现能适应不同缺失模式的有效插补。
三个蛋白质组学数据集和一个代谢组学数据集表明,缺失值可能是丰度依赖性和丰度独立性缺失的混合。我们使用模拟数据(具有80种广泛的缺失模式)以及来自癌症基因组图谱乳腺癌和肾透明细胞癌研究的代谢组学数据,评估了GMSimpute的性能。通过真实丰度与插补丰度之间的Pearson相关性和归一化均方根误差,我们将其性能与K近邻类型方法、随机森林、GSimp、DanteR中实现的基于模型的方法以及最小值进行了比较。结果表明,GMSimpute在插补方面提供了更高的准确性,并且在不同的缺失模式下表现出稳定的性能。此外,当应用于癌症基因组图谱数据集时,GMSimpute能够在下游差异表达分析中高精度地识别特征。
GMSimpute可在CRAN上获取:https://cran.r-project.org/web/packages/GMSimpute/index.html。
补充数据可在《生物信息学》在线获取。