DIMA：基于数据驱动的插补算法选择。

DIMA: Data-Driven Selection of an Imputation Algorithm.

机构信息

Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.

Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.

出版信息

J Proteome Res. 2021 Jul 2;20(7):3489-3496. doi: 10.1021/acs.jproteome.1c00119. Epub 2021 Jun 1.

DOI:10.1021/acs.jproteome.1c00119

PMID:34062065

Abstract

Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5-50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.

摘要

在蛋白质组学数据分析管道中处理缺失值 (MVs) 时，插补是一种突出的策略。然而，评估不同插补方法的性能是困难的，并且强烈依赖于数据特征。为了克服这个问题，我们提出了一种数据驱动的选择插补算法 (DIMA) 的概念。DIMA 的性能和广泛适用性在来自 PRoteomics IDEntifications (PRIDE) 数据库的 142 个定量蛋白质组学数据集和由 5-50%MV 组成的模拟数据上得到了验证，其中 MV 的缺失部分为非随机缺失和完全随机缺失，比例不同。DIMA 可靠地建议了一种高性能的插补算法，该算法始终是三种最佳算法之一，在 80%的情况下导致均方根误差差异 (ΔRMSE) ≤ 10%。DIMA 的实现可在 MATLAB 中于 github.com/kreutz-lab/OmicsData 获得，也可在 R 中于 github.com/kreutz-lab/DIMAR 获得。