Rudnick Paul A, Wang Xia, Yan Xinjian, Sedransk Nell, Stein Stephen E
Mass Spectrometry Data Center, National Institute of Standards and Technology, Gaithersburg, Maryland;
Mol Cell Proteomics. 2014 May;13(5):1341-51. doi: 10.1074/mcp.M113.030593. Epub 2014 Feb 21.
Normalization is an important step in the analysis of quantitative proteomics data. If this step is ignored, systematic biases can lead to incorrect assumptions about regulation. Most statistical procedures for normalizing proteomics data have been borrowed from genomics where their development has focused on the removal of so-called 'batch effects.' In general, a typical normalization step in proteomics works under the assumption that most peptides/proteins do not change; scaling is then used to give a median log-ratio of 0. The focus of this work was to identify other factors, derived from knowledge of the variables in proteomics, which might be used to improve normalization. Here we have examined the multi-laboratory data sets from Phase I of the NCI's CPTAC program. Surprisingly, the most important bias variables affecting peptide intensities within labs were retention time and charge state. The magnitude of these observations was exaggerated in samples of unequal concentrations or "spike-in" levels, presumably because the average precursor charge for peptides with higher charge state potentials is lower at higher relative sample concentrations. These effects are consistent with reduced protonation during electrospray and demonstrate that the physical properties of the peptides themselves can serve as good reporters of systematic biases. Between labs, retention time, precursor m/z, and peptide length were most commonly the top-ranked bias variables, over the standardly used average intensity (A). A larger set of variables was then used to develop a stepwise normalization procedure. This statistical model was found to perform as well or better on the CPTAC mock biomarker data than other commonly used methods. Furthermore, the method described here does not require a priori knowledge of the systematic biases in a given data set. These improvements can be attributed to the inclusion of variables other than average intensity during normalization.
归一化是定量蛋白质组学数据分析中的重要步骤。如果忽略这一步骤,系统偏差可能会导致关于调控的错误假设。大多数用于蛋白质组学数据归一化的统计程序都借鉴自基因组学,在基因组学中,其开发重点在于消除所谓的“批次效应”。一般来说,蛋白质组学中典型的归一化步骤是在大多数肽/蛋白质不变的假设下进行的;然后使用缩放来使中位数对数比为0。这项工作的重点是从蛋白质组学变量知识中识别其他可能用于改进归一化的因素。在这里,我们检查了美国国立癌症研究所(NCI)临床蛋白质组肿瘤分析联盟(CPTAC)项目第一阶段的多实验室数据集。令人惊讶的是,影响实验室内肽强度的最重要偏差变量是保留时间和电荷状态。在浓度不等或“加标”水平的样本中,这些观察结果的影响被放大了,可能是因为在较高的相对样本浓度下,具有较高电荷状态潜力的肽的平均前体电荷较低。这些效应与电喷雾过程中质子化减少一致,并表明肽本身的物理性质可以作为系统偏差的良好指标。在不同实验室之间,保留时间、前体质荷比和肽长度最常是排名靠前的偏差变量,超过了标准使用的平均强度(A)。然后使用更大的一组变量来开发逐步归一化程序。发现这个统计模型在CPTAC模拟生物标志物数据上的表现与其他常用方法一样好或更好。此外,这里描述的方法不需要对给定数据集中的系统偏差有先验知识。这些改进可归因于在归一化过程中纳入了平均强度以外的变量。