Proteome Center Tuebingen, University of Tuebingen, 72076 Tuebingen, Germany.
Mol Cell Proteomics. 2013 Nov;12(11):3420-30. doi: 10.1074/mcp.M113.029165. Epub 2013 Aug 1.
Recent advances in mass spectrometry (MS) have led to increased applications of shotgun proteomics to the refinement of genome annotation. The typical "proteo-genomic" workflows rely on the mapping of peptide MS/MS spectra onto databases derived via six-frame translation of the genome sequence. These databases contain a large proportion of spurious protein sequences which make the statistical confidence of the resulting peptide spectrum matches difficult to assess. Here we performed a comprehensive analysis of the Escherichia coli proteome using LTQ-Orbitrap MS and mapped the corresponding MS/MS spectra onto a six-frame translation of the E. coli genome. We hypothesized that the protein-coding part of the E. coli genome approaches complete annotation and that the majority of six frame-specific (novel) peptide spectrum matches can be considered as false positive identifications. We confirm our hypothesis by showing that the posterior error probability distribution of novel hits is almost identical to that of reversed (decoy) hits; this enables us to estimate the sensitivity, specificity, accuracy, and false discovery rate in a typical bacterial proteo-genomic dataset. We use two complementary computational frameworks for processing and statistical assessment of MS/MS data: MaxQuant and Trans-Proteomic Pipeline. We show that MaxQuant achieves a more sensitive six-frame database search with an acceptable false discovery rate and is therefore well suited for global genome reannotation applications, whereas the Trans-Proteomic Pipeline achieves higher specificity and is well suited for high-confidence validation. The use of a small and well-annotated bacterial genome enables us to address genome coverage achieved in state-of-the-art bacterial proteomics: identified peptide sequences mapped to all expressed E. coli proteins but covered 31.7% of the protein-coding genome sequence. Our results show that false discovery rates can be substantially underestimated even in "simple" proteo-genomic experiments obtained by means of high-accuracy MS and point to the necessity of further improvements concerning the coverage of peptide sequences by MS-based methods.
近年来,质谱(MS)技术的进步使得 shotgun 蛋白质组学在基因组注释的精细化方面得到了广泛应用。典型的“蛋白质组学”工作流程依赖于通过对基因组序列进行六框架翻译来构建数据库,然后将肽 MS/MS 谱图映射到该数据库中。这些数据库中包含大量虚假的蛋白质序列,这使得对产生的肽谱匹配的统计置信度难以评估。在这里,我们使用 LTQ-Orbitrap MS 对大肠杆菌蛋白质组进行了全面分析,并将相应的 MS/MS 谱图映射到大肠杆菌基因组的六框架翻译上。我们假设大肠杆菌基因组的蛋白质编码部分接近完全注释,并且大多数六框架特异性(新颖)肽谱匹配可以被认为是假阳性鉴定。我们通过证明新颖命中的后验错误概率分布几乎与反转(诱饵)命中的后验错误概率分布相同,从而证实了我们的假设;这使我们能够在典型的细菌蛋白质组学数据集中估计灵敏度、特异性、准确性和假发现率。我们使用两种互补的计算框架来处理和统计评估 MS/MS 数据:MaxQuant 和 Trans-Proteomic Pipeline。我们表明,MaxQuant 可以实现更敏感的六框架数据库搜索,同时保持可接受的假发现率,因此非常适合于全局基因组重新注释应用,而 Trans-Proteomic Pipeline 则可以实现更高的特异性,非常适合于高置信度验证。使用一个小型且注释良好的细菌基因组使我们能够解决最先进的细菌蛋白质组学中实现的基因组覆盖问题:鉴定出的肽序列映射到所有表达的大肠杆菌蛋白质上,但仅覆盖了 31.7%的蛋白质编码基因组序列。我们的结果表明,即使在通过高精度 MS 获得的“简单”蛋白质组学实验中,假发现率也可能会被大大低估,并指出需要进一步改进基于 MS 的方法对肽序列的覆盖度。