Nestlé Institute of Food Safety & Analytical Sciences, Nestlé Research, EPFL Innovation Park, 1015, Lausanne, Switzerland.
Nestlé Institute of Food Safety & Analytical Sciences, Nestlé Research, EPFL Innovation Park, 1015, Lausanne, Switzerland; Chemistry and Chemical Engineering Section, School of Basic Sciences, Ecole Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland.
Biosystems. 2022 Jun;215-216:104661. doi: 10.1016/j.biosystems.2022.104661. Epub 2022 Mar 2.
Large-scale proteomic studies have to deal with unwanted variability, especially when samples originate from different centers and multiple analytical batches are needed. Such variability is typically added throughout all the steps of a clinical research study, from human biological sample collection and storage, sample preparation, spectral data acquisition, to peptide and protein quantification. In order to remove such diverse and unwanted variability, normalization of the protein data is performed. There have been already several published reviews comparing normalization methods in the -omics field, but reports focusing on proteomic data generated with mass spectrometry (MS) are much fewer. Additionally, most of these reports have only dealt with small datasets.
As a case study, here we focused on the normalization of a large MS-based proteomic dataset obtained from an overweight and obese pan-European cohort, where different normalization methods were evaluated, namely: center standardize, quantile protein, quantile sample, global standardization, ComBat, median centering, mean centering, single standard and removal of unwanted variation (RUV); some of these are generic normalization methods while others have been specifically created to deal with genomic or metabolomic data. We checked how relationships between proteins and clinical variables (e.g., gender, levels of triglycerides or cholesterol) were improved after normalizing the data with the different methods.
Some normalization methods were better adapted for this particular large-scale shotgun proteomic dataset of human plasma samples labeled with isobaric tags and analyzed with liquid chromatography-tandem MS. In particular, quantile sample normalization, RUV, mean and median centering showed very good performances, while quantile protein normalization provided worse results than those obtained with unnormalized data.
大规模蛋白质组学研究必须处理不必要的变异性,特别是当样品来自不同的中心且需要多个分析批次时。这种变异性通常会在临床研究的所有步骤中增加,从人类生物样本的收集和储存、样本制备、光谱数据采集到肽和蛋白质定量。为了去除这种多样化和不必要的变异性,对蛋白质数据进行了标准化处理。已经有几篇关于在组学领域比较标准化方法的综述,但针对基于质谱(MS)的蛋白质组学数据的报告要少得多。此外,这些报告中的大多数只处理了小数据集。
作为一个案例研究,我们在这里专注于对一个来自超重和肥胖的泛欧队列的基于 MS 的大型蛋白质组学数据集进行标准化,评估了不同的标准化方法,即:中心标准化、分位数蛋白、分位数样本、全局标准化、ComBat、中位数中心化、均值中心化、单一标准和去除不必要的变异(RUV);其中一些是通用的标准化方法,而另一些则是专门为处理基因组或代谢组数据而创建的。我们检查了在使用不同方法对数据进行标准化后,蛋白质与临床变量(如性别、甘油三酯或胆固醇水平)之间的关系如何得到改善。
一些标准化方法更适合这个特定的、大规模的人类血浆样本的 shotgun 蛋白质组学数据集,这些样本用等压标签标记并通过液相色谱-串联 MS 进行分析。特别是分位数样本标准化、RUV、均值和中位数中心化表现非常好,而分位数蛋白标准化的结果比未标准化数据的结果更差。