Wryk Grzegorz, Gawor Andrzej, Bulska Ewa
Faculty of Physics, University of Warsaw, Pasteura 5, 02-093 Warsaw, Poland.
Biological and Chemical Research Centre, Faculty of Chemistry, University of Warsaw, Zwirki i Wigury 101, 02-089 Warsaw, Poland.
Int J Mol Sci. 2024 Dec 17;25(24):13491. doi: 10.3390/ijms252413491.
Mass-spectrometry-based proteomics frequently utilizes label-free quantification strategies due to their cost-effectiveness, methodological simplicity, and capability to identify large numbers of proteins within a single analytical run. Despite these advantages, the prevalence of missing values (MV), which can impact up to 50% of the data matrix, poses a significant challenge by reducing the accuracy, reproducibility, and interpretability of the results. Consequently, effective handling of missing values is crucial for reliable quantitative analysis in proteomic studies. This study systematically evaluated the performance of selected imputation methods for addressing missing values in proteomic dataset. Two protein identification algorithms, FragPipe and MaxQuant, were employed to generate datasets, enabling an assessment of their influence on im-putation efficacy. Ten imputation methods, representing three methodological categories-single-value (LOD, ND, SampMin), local-similarity (kNN, LLS, RF), and global-similarity approaches (LSA, BPCA, PPCA, SVD)-were analyzed. The study also investigated the impact of data logarithmization on imputation performance. The evaluation process was conducted in two stages. First, performance metrics including normalized root mean square error (NRMSE) and the area under the receiver operating characteristic (ROC) curve (AUC) were applied to datasets with artificially introduced missing values. The datasets were designed to mimic varying MV rates (10%, 25%, 50%) and proportions of values missing not at random (MNAR) (0%, 20%, 40%, 80%, 100%). This step enabled the assessment of data characteristics on the relative effectiveness of the imputation methods. Second, the imputation strategies were applied to real proteomic datasets containing natural missing values, focusing on the true-positive (TP) classification of proteins to evaluate their practical utility. The findings highlight that local-similarity-based methods, particularly random forest (RF) and local least-squares (LLS), consistently exhibit robust performance across varying MV scenarios. Furthermore, data logarithmization significantly enhances the effectiveness of global-similarity methods, suggesting it as a beneficial preprocessing step prior to imputation. The study underscores the importance of tailoring imputation strategies to the specific characteristics of the data to maximize the reliability of label-free quantitative proteomics. Interestingly, while the choice of protein identification algorithm (FragPipe vs. MaxQuant) had minimal influence on the overall imputation error, differences in the number of proteins classified as true positives revealed more nuanced effects, emphasizing the interplay between imputation strategies and downstream analysis outcomes. These findings provide a comprehensive framework for improving the accuracy and reproducibility of proteomic analyses through an informed selection of imputation approaches.
基于质谱的蛋白质组学经常采用无标记定量策略,因为它们具有成本效益、方法简单,并且能够在单次分析运行中鉴定大量蛋白质。尽管有这些优点,但缺失值(MV)的普遍存在会影响高达50%的数据矩阵,通过降低结果的准确性、可重复性和可解释性带来重大挑战。因此,有效处理缺失值对于蛋白质组学研究中的可靠定量分析至关重要。本研究系统评估了所选插补方法在解决蛋白质组学数据集中缺失值方面的性能。采用两种蛋白质鉴定算法FragPipe和MaxQuant生成数据集,以便评估它们对插补效果的影响。分析了十种插补方法,代表三种方法类别——单值(LOD、ND、SampMin)、局部相似性(kNN、LLS、RF)和全局相似性方法(LSA、BPCA、PPCA、SVD)。该研究还调查了数据对数转换对插补性能的影响。评估过程分两个阶段进行。首先,将包括归一化均方根误差(NRMSE)和接收器操作特征(ROC)曲线下面积(AUC)在内的性能指标应用于人工引入缺失值的数据集。这些数据集旨在模拟不同的MV率(10%、25%、50%)和非随机缺失值(MNAR)的比例(0%、20%、40%、80%、100%)。这一步骤能够评估数据特征对插补方法相对有效性的影响。其次,将插补策略应用于包含自然缺失值的真实蛋白质组学数据集,重点关注蛋白质的真阳性(TP)分类,以评估它们的实际效用。研究结果表明,基于局部相似性的方法,特别是随机森林(RF)和局部最小二乘法(LLS),在不同的MV情况下始终表现出稳健的性能。此外,数据对数转换显著提高了全局相似性方法的有效性,表明它是插补前有益的预处理步骤。该研究强调了根据数据的特定特征定制插补策略以最大化无标记定量蛋白质组学可靠性的重要性。有趣的是,虽然蛋白质鉴定算法(FragPipe与MaxQuant)的选择对总体插补误差影响最小,但分类为真阳性的蛋白质数量差异显示出更细微的影响,强调了插补策略与下游分析结果之间的相互作用。这些发现为通过明智选择插补方法提高蛋白质组学分析的准确性和可重复性提供了一个全面的框架。