Suppr超能文献

一种简单的优化工作流程,可实现蛋白质组学数据集缺失值的精确和准确插补。

A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets.

机构信息

Center for Bioinformatics and Functional Genomics, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States.

Graduate Program in Biomedical Sciences, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States.

出版信息

J Proteome Res. 2021 Jun 4;20(6):3214-3229. doi: 10.1021/acs.jproteome.1c00070. Epub 2021 May 3.

Abstract

Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification level-fragment level-improved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set's most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

摘要

蛋白质组学数据集中的缺失值会对下游数据分析和可重复性产生实际影响。虽然存在几种用于处理缺失值的插补方法,但没有一种插补方法最适合各种数据集,并且对于临床 DIA-MS 数据集,特别是在不同蛋白质定量水平下,也没有明确的策略来评估插补方法。为了在可用的文献中探索不同的插补策略,我们制定了一种策略,用于评估临床无标签 DIA-MS 数据集的插补方法。我们使用三个具有真实缺失值的 DIA-MS 数据集,在不同蛋白质定量水平下使用多个参数评估了八种插补方法:一个稀释系列数据集、一个小型试点数据集和一个比较配对肿瘤和基质组织的临床蛋白质组学数据集。我们发现,基于数据内部局部结构的插补方法,如局部最小二乘法(LLS)和随机森林(RF),在我们的稀释系列数据集中效果很好,而基于数据内部全局结构的插补方法,如 BPCA,在其他两个数据集中效果很好。我们还发现,在最基本的蛋白质定量水平(片段水平)进行插补可以提高准确性和定量蛋白质的数量。通过这种分析框架,我们使用两个较小的互补数据集快速且经济有效地评估了不同的插补方法,从而缩小了范围,确定了较大蛋白质组数据集的最准确方法。这种采集策略使我们能够提供插补方法准确性的可重复证据,即使在没有真实值的情况下也是如此。总的来说,这项研究表明,最合适的插补方法取决于数据集的整体结构,并提供了一个分析框架的示例,该框架可能有助于确定用于蛋白质差异分析的最合适的插补策略。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验