Department of Public Health Sciences, University of Chicago, 5841 S. Maryland Ave., Chicago, IL, USA.
Department of Genetics and Genomics Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, 770 Lexington Avenue, New York, NY, USA.
Biostatistics. 2019 Oct 1;20(4):648-665. doi: 10.1093/biostatistics/kxy022.
In quantitative proteomics, mass tag labeling techniques have been widely adopted in mass spectrometry experiments. These techniques allow peptides (short amino acid sequences) and proteins from multiple samples of a batch being detected and quantified in a single experiment, and as such greatly improve the efficiency of protein profiling. However, the batch-processing of samples also results in severe batch effects and non-ignorable missing data occurring at the batch level. Motivated by the breast cancer proteomic data from the Clinical Proteomic Tumor Analysis Consortium, in this work, we developed two tailored multivariate MIxed-effects SElection models (mvMISE) to jointly analyze multiple correlated peptides/proteins in labeled proteomics data, considering the batch effects and the non-ignorable missingness. By taking a multivariate approach, we can borrow information across multiple peptides of the same protein or multiple proteins from the same biological pathway, and thus achieve better statistical efficiency and biological interpretation. These two different models account for different correlation structures among a group of peptides or proteins. Specifically, to model multiple peptides from the same protein, we employed a factor-analytic random effects structure to characterize the high and similar correlations among peptides. To model biological dependence among multiple proteins in a functional pathway, we introduced a graphical lasso penalty on the error precision matrix, and implemented an efficient algorithm based on the alternating direction method of multipliers. Simulations demonstrated the advantages of the proposed models. Applying the proposed methods to the motivating data set, we identified phosphoproteins and biological pathways that showed different activity patterns in triple negative breast tumors versus other breast tumors. The proposed methods can also be applied to other high-dimensional multivariate analyses based on clustered data with or without non-ignorable missingness.
在定量蛋白质组学中,质量标记标签技术已广泛应用于质谱实验中。这些技术允许在单个实验中同时检测和定量一批多个样本的肽(短氨基酸序列)和蛋白质,从而极大地提高了蛋白质谱分析的效率。然而,样本的批量处理也会导致批次效应和不可忽略的缺失数据在批次水平上发生。受临床蛋白质组肿瘤分析联盟的乳腺癌蛋白质组数据的启发,在这项工作中,我们开发了两种定制的多元混合效应选择模型(mvMISE),以联合分析标记蛋白质组学数据中的多个相关肽/蛋白质,同时考虑批次效应和不可忽略的缺失值。通过采用多元方法,我们可以在同一蛋白质的多个肽或同一生物学途径的多个蛋白质之间借用信息,从而实现更好的统计效率和生物学解释。这两种不同的模型考虑了一组肽或蛋白质之间不同的相关结构。具体来说,为了对来自同一蛋白质的多个肽建模,我们采用了因子分析随机效应结构来描述肽之间的高度相似相关性。为了对功能途径中的多个蛋白质之间的生物学依赖性建模,我们在误差精度矩阵上引入了图形套索惩罚,并基于交替方向乘子法实现了一种有效的算法。模拟结果证明了所提出模型的优势。将所提出的方法应用于激励数据集,我们鉴定了在三阴性乳腺癌与其他乳腺癌之间显示不同活性模式的磷酸化蛋白质和生物学途径。所提出的方法还可以应用于其他基于聚类数据的高维多元分析,无论是否存在不可忽略的缺失值。