Normics：基于方差和数据固有相关性结构的蛋白质组学标准化。

Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure.

机构信息

Institute of Pathology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany; Institute of Pathology, University Medical Center Schleswig-Holstein, Luebeck Site, Luebeck, Germany.

Mildred Scheel School of Oncology, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany; Department of Translational Genomics, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany; Center for Molecular Medicine Cologne, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany.

出版信息

Mol Cell Proteomics. 2022 Sep;21(9):100269. doi: 10.1016/j.mcpro.2022.100269. Epub 2022 Jul 16.

DOI:10.1016/j.mcpro.2022.100269

PMID:35853575

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9450154/

Abstract

Several algorithms for the normalization of proteomic data are currently available, each based on a priori assumptions. Among these is the extent to which differential expression (DE) can be present in the dataset. This factor is usually unknown in explorative biomarker screens. Simultaneously, the increasing depth of proteomic analyses often requires the selection of subsets with a high probability of being DE to obtain meaningful results in downstream bioinformatical analyses. Based on the relationship of technical variation and (true) biological DE of an unknown share of proteins, we propose the "Normics" algorithm: Proteins are ranked based on their expression level-corrected variance and the mean correlation with all other proteins. The latter serves as a novel indicator of the non-DE likelihood of a protein in a given dataset. Subsequent normalization is based on a subset of non-DE proteins only. No a priori information such as batch, clinical, or replicate group is necessary. Simulation data demonstrated robust and superior performance across a wide range of stochastically chosen parameters. Five publicly available spike-in and biologically variant datasets were reliably and quantitively accurately normalized by Normics with improved performance compared to standard variance stabilization as well as median, quantile, and LOESS normalizations. In complex biological datasets Normics correctly determined proteins as being DE that had been cross-validated by an independent transcriptome analysis of the same samples. In both complex datasets Normics identified the most DE proteins. We demonstrate that combining variance analysis and data-inherent correlation structure to identify non-DE proteins improves data normalization. Standard normalization algorithms can be consolidated against high shares of (one-sided) biological regulation. The statistical power of downstream analyses can be increased by focusing on Normics-selected subsets of high DE likelihood.

摘要

目前有几种用于蛋白质组学数据标准化的算法，每种算法都基于先验假设。其中一个因素是数据集是否存在差异表达（DE）。在探索性生物标志物筛选中，通常无法预测这个因素。同时，蛋白质组学分析的深度不断增加，通常需要选择具有高 DE 可能性的子集，以便在下游生物信息学分析中获得有意义的结果。基于技术变异和（真实）生物 DE 未知部分的蛋白质之间的关系，我们提出了“Normics”算法：根据其表达水平校正后的方差和与所有其他蛋白质的平均相关性对蛋白质进行排序。后者是给定数据集中蛋白质非 DE 可能性的新指标。随后的归一化仅基于非 DE 蛋白质子集。不需要批次、临床或重复组等先验信息。模拟数据表明，在广泛的随机选择参数范围内，该算法具有稳健且优越的性能。通过 Normics 可靠地归一化了五个公开的 Spike-in 和生物学变异数据集，并与标准方差稳定化、中位数、分位数和 LOESS 归一化相比，具有更好的性能。在复杂的生物学数据集中，Normics 正确地确定了通过对相同样本的独立转录组分析交叉验证的 DE 蛋白质。在两个复杂的数据集 Normics 中都确定了最 DE 的蛋白质。我们证明，通过分析方差和数据内在的相关性结构来识别非 DE 蛋白质，可以改善数据归一化。标准归一化算法可以整合高比例（单边）生物调控。通过关注 Normics 选择的高 DE 可能性子集，可以提高下游分析的统计能力。