Chen Lin S, Wang Jiebiao, Wang Xianlong, Wang Pei
Department of Public Health Sciences, University of Chicago, 5841 S Maryland Ave, Chicago, Illinois, USA.
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, Washington 98109, USA.
Ann Appl Stat. 2017 Mar;11(1):114-138. doi: 10.1214/16-AOAS994. Epub 2017 Apr 8.
In mass spectrometry (MS) based quantitative proteomics research, the emerging iTRAQ (isobaric tag for relative and absolute quantitation) and TMT (tandem mass tags) techniques have been widely adopted for high throughput protein profiling. In a typical iTRAQ/TMT proteomics study, samples are grouped into batches, and each batch is processed by one multiplex experiment, in which the abundances of thousands of proteins/peptides in a batch of samples can be measured simultaneously. The multiplex labeling technique greatly enhances the throughput of protein quantification. However, the technical variation across different iTRAQ/TMT multiplex experiments is often large due to the dynamic nature of MS instruments. This leads to strong batch effects in the iTRAQ/TMT data. Moreover, the iTRAQ/TMT data often contain substantial batch-level nonignorable missing entries. Specifically, the abundance measures of a given protein/peptide are often either observed or missing altogether in all the samples from the same batch, with the missing probability depending on the combined batch-level abundances. We term this unique missing-data mechanism as the Batch-level Abundance-Dependent Missing-data Mechanism (BADMM). We introduce a new method- mixEMM-for analyzing iTRAQ/TMT data with batch effects and batch-level nonignorable missingness. The mixEMM method employs a linear mixed-effects model and explicitly models the batch effects and the BADMM. With simulation studies, we showed that, compared with existing approaches that utilize relative abundances and ignore the missing batches under the missing-completely-at-random assumption, the mixEMM method achieves more accurate parameter estimation and inference. We applied the method to an iTRAQ proteomics data from a breast cancer study and identified phosphopeptides differentially expressed between different breast cancer subtypes. The method can be applied to general clustered data with cluster-level nonignorable missing-data mechanisms.
在基于质谱(MS)的定量蛋白质组学研究中,新兴的iTRAQ(相对和绝对定量的等压标签)和TMT(串联质谱标签)技术已被广泛用于高通量蛋白质谱分析。在典型的iTRAQ/TMT蛋白质组学研究中,样本被分组为批次,每个批次通过一个多重实验进行处理,在该实验中,可以同时测量一批样本中数千种蛋白质/肽的丰度。多重标记技术大大提高了蛋白质定量的通量。然而,由于质谱仪器的动态特性,不同iTRAQ/TMT多重实验之间的技术差异往往很大。这导致iTRAQ/TMT数据中存在强烈的批次效应。此外,iTRAQ/TMT数据通常包含大量批次水平上不可忽略的缺失值。具体而言,给定蛋白质/肽的丰度测量值在同一批次的所有样本中通常要么被观测到,要么完全缺失,缺失概率取决于组合的批次水平丰度。我们将这种独特的缺失数据机制称为批次水平丰度依赖缺失数据机制(BADMM)。我们引入了一种新方法——mixEMM,用于分析具有批次效应和批次水平不可忽略缺失值的iTRAQ/TMT数据。mixEMM方法采用线性混合效应模型,并明确对批次效应和BADMM进行建模。通过模拟研究,我们表明,与在完全随机缺失假设下利用相对丰度并忽略缺失批次的现有方法相比,mixEMM方法实现了更准确的参数估计和推断。我们将该方法应用于一项乳腺癌研究的iTRAQ蛋白质组学数据,并鉴定了不同乳腺癌亚型之间差异表达的磷酸化肽。该方法可应用于具有聚类水平不可忽略缺失数据机制的一般聚类数据。