Ryu So Young, Qian Wei-Jun, Camp David G, Smith Richard D, Tompkins Ronald G, Davis Ronald W, Xiao Wenzhong
Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA.
Stanford Genome Technology Center, Stanford University, Stanford, CA 94305, USA, Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352, USA and Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA.
Bioinformatics. 2014 Oct;30(19):2741-6. doi: 10.1093/bioinformatics/btu341. Epub 2014 Jun 12.
Mass spectrometry (MS)-based high-throughput quantitative proteomics shows great potential in large-scale clinical biomarker studies, identifying and quantifying thousands of proteins in biological samples. However, there are unique challenges in analyzing the quantitative proteomics data. One issue is that the quantification of a given peptide is often missing in a subset of the experiments, especially for less abundant peptides. Another issue is that different MS experiments of the same study have significantly varying numbers of peptides quantified, which can result in more missing peptide abundances in an experiment that has a smaller total number of quantified peptides. To detect as many biomarker proteins as possible, it is necessary to develop bioinformatics methods that appropriately handle these challenges.
We propose a Significance Analysis for Large-scale Proteomics Studies (SALPS) that handles missing peptide intensity values caused by the two mechanisms mentioned above. Our model has a robust performance in both simulated data and proteomics data from a large clinical study. Because varying patients' sample qualities and deviating instrument performances are not avoidable for clinical studies performed over the course of several years, we believe that our approach will be useful to analyze large-scale clinical proteomics data.
R codes for SALPS are available at http://www.stanford.edu/%7eclairesr/software.html.
基于质谱(MS)的高通量定量蛋白质组学在大规模临床生物标志物研究中显示出巨大潜力,可对生物样品中的数千种蛋白质进行鉴定和定量。然而,在分析定量蛋白质组学数据时存在独特的挑战。一个问题是,在一部分实验中,给定肽段的定量往往缺失,尤其是对于丰度较低的肽段。另一个问题是,同一研究的不同质谱实验中定量的肽段数量差异很大,这可能导致在定量肽段总数较少的实验中出现更多缺失的肽段丰度。为了尽可能多地检测生物标志物蛋白质,有必要开发能够适当应对这些挑战的生物信息学方法。
我们提出了一种用于大规模蛋白质组学研究的显著性分析(SALPS)方法,该方法可处理由上述两种机制导致的缺失肽段强度值。我们的模型在模拟数据和来自一项大型临床研究的蛋白质组学数据中均具有稳健的性能。由于在数年的临床研究中,患者样本质量的差异和仪器性能的偏差是不可避免的,我们相信我们的方法将有助于分析大规模临床蛋白质组学数据。
SALPS的R代码可在http://www.stanford.edu/%7eclairesr/software.html获取。