Computing & Analytics Division, Pacific Northwest National Laboratory, Richland, Washington 99354, United States.
Boeing, Seattle, Washington 98055, United States.
J Proteome Res. 2021 Jan 1;20(1):1-13. doi: 10.1021/acs.jproteome.0c00123. Epub 2020 Sep 25.
The throughput efficiency and increased depth of coverage provided by isobaric-labeled proteomics measurements have led to increased usage of these techniques. However, the structure of missing data is different than unlabeled studies, which prompts the need for this review to compare the efficacy of nine imputation methods on large isobaric-labeled proteomics data sets to guide researchers on the appropriateness of various imputation methods. Imputation methods were evaluated by accuracy, statistical hypothesis test inference, and run time. In general, expectation maximization and random forest imputation methods yielded the best performance, and constant-based methods consistently performed poorly across all data set sizes and percentages of missing values. For data sets with small sample sizes and higher percentages of missing data, results indicate that statistical inference with no imputation may be preferable. On the basis of the findings in this review, there are core imputation methods that perform better for isobaric-labeled proteomics data, but great care and consideration as to whether imputation is the optimal strategy should be given for data sets comprised of a small number of samples.
同重标记蛋白质组学测量的高通量效率和增加的覆盖深度导致了这些技术的使用增加。然而,缺失数据的结构与未标记的研究不同,这促使我们需要对这九种插补方法在大型同重标记蛋白质组学数据集上的功效进行比较,以指导研究人员选择各种插补方法的适当性。通过准确性、统计假设检验推断和运行时间来评估插补方法。一般来说,期望最大化和随机森林插补方法的性能最好,而基于常数的方法在所有数据集大小和缺失值百分比下的性能都很差。对于样本量较小且缺失数据百分比较高的数据集,结果表明没有插补的统计推断可能是更好的选择。基于本综述的结果,对于同重标记蛋白质组学数据,有一些核心的插补方法表现更好,但对于由少数样本组成的数据集,应慎重考虑是否采用插补策略。