CALIPHO Group , SIB Swiss Institute of Bioinformatics, CMU , Rue Michel-Servet 1 , CH-1211 Geneva , Switzerland.
Proteome Informatics Group , SIB Swiss Institute of Bioinformatics, CMU , Rue Michel-Servet 1 , CH-1211 Geneva , Switzerland.
J Proteome Res. 2018 Dec 7;17(12):4160-4170. doi: 10.1021/acs.jproteome.8b00392. Epub 2018 Sep 17.
The practice of data sharing in the proteomics field took off and quickly spread in recent years as a result of collective effort. Nowadays, most journal editors mandate the submission of the original raw mass spectra to one of the databases of the ProteomeXchange consortium. With the exception of large institutional initiatives such as PeptideAtlas or the GPMDB, few new studies are however based on the reanalysis of mass spectrometry data. A wealth of information is thus left unexploited in public databases and repositories. Here, we present the large-scale reanalysis of 41 publicly available data sets corresponding to experiments carried out on the HeLa cancer cell line using a custom workflow. In addition to the search of new post-translational modification sites and "missing proteins", our main goal is to identify single amino acid variants and evaluate their impact on protein expression and stability through the spectral counting quantification approach. The X!Tandem software was selected to perform the search of a total of 56 363 701 tandem mass spectra against a customized variant protein database, compiled by the application of the in-house MzVar tool on HeLa-specific somatic and genomic variants retrieved from the COSMIC cell line project. After filtering the resulting identifications with a 1% FDR threshold computed at the protein level, 49 466 unique peptides were identified in 7266 protein entries, allowing the validation of 5576 protein entries in accordance with the HPP guidelines version 2.1. A new "missing protein" was observed (FRAT2, NX_O75474, chromosome 10), and 189 new phosphorylation and 392 new protein N-terminal acetylation sites could be identified. Twenty-four variant peptides were also identified, corresponding to 21 variants in 21 proteins. For three of the nine heterozygous cases where both the variant peptide and its wild-type counterpart were detected, the application of a two-tailed sign test showed a significant difference in the abundance of the two peptide versions.
近年来,由于集体努力,蛋白质组学领域的数据共享实践迅速兴起并得到广泛传播。如今,大多数期刊编辑都要求将原始质谱数据提交给 ProteomeXchange 联盟的数据库之一。然而,除了 PeptideAtlas 或 GPMDB 等大型机构倡议外,很少有新的研究基于质谱数据的重新分析。因此,大量信息仍未在公共数据库和存储库中得到利用。在这里,我们介绍了对 41 个公开可用数据集的大规模重新分析,这些数据集对应于使用自定义工作流程在 HeLa 癌细胞系上进行的实验。除了搜索新的翻译后修饰位点和“缺失蛋白”外,我们的主要目标是通过谱计数定量方法识别单个氨基酸变体,并评估它们对蛋白质表达和稳定性的影响。我们选择了 X!Tandem 软件来对总共 56363701 个串联质谱进行搜索,这些质谱是针对一个定制的变体蛋白数据库进行搜索的,该数据库是通过在 HeLa 特异性体细胞和基因组变体上应用内部 MzVar 工具,从 COSMIC 细胞系项目中检索到的变体蛋白数据库编译而成的。在以 1% FDR 阈值在蛋白质水平上过滤鉴定结果后,在 7266 个蛋白质条目中共鉴定出 49466 个独特肽段,根据 HPP 指南版本 2.1 验证了 5576 个蛋白质条目。观察到一个新的“缺失蛋白”(FRAT2,NX_O75474,染色体 10),并可以鉴定出 189 个新的磷酸化和 392 个新的蛋白质 N 端乙酰化位点。还鉴定出 24 个变体肽段,对应于 21 个蛋白质中的 21 个变体。对于三个杂合病例,其中都检测到变体肽和其野生型对应物,应用双侧符号检验表明两种肽版本的丰度存在显著差异。