Faculty of Life Sciences, The University of Manchester, Manchester M13 9PT, UK.
J Proteome Res. 2012 Nov 2;11(11):5221-34. doi: 10.1021/pr300411q. Epub 2012 Oct 15.
Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives.
蛋白质基因组学有可能通过从质谱实验中得出的高质量肽鉴定来推进基因组注释,这些实验证明了特定的基因或同工型在蛋白质水平上表达和翻译。这可以增进我们对基因组功能的理解,发现尚未被识别或验证的新基因和基因结构。由于大多数蛋白质组学实验具有高通量的鸟枪法性质,因此必须仔细控制假阳性并防止任何潜在的错误注释。目前在蛋白质组学中广泛使用了许多统计程序来处理这些问题,这些程序为组和单个肽谱匹配(PSM)计算错误发现率(FDR)和后验误差概率(PEP)值。这些方法控制了多重测试并利用诱饵数据库来估计统计显著性。在这里,我们表明数据库的选择对这些置信度估计有重大影响,导致报告的 PSM 数量有显著差异。我们注意到,使用核苷酸序列的六框架翻译(如组装的转录组数据)的标准靶标:诱饵方法显然低估了分配给 PSM 的置信度。这种错误的根源在于六框架数据库的膨胀和异常性质,其中对于每个目标序列,存在五个不太可能编码蛋白质的“错误”目标。随之而来的 FDR 和 PEP 估计会导致在固定阈值下接受的 PSM 更少,我们表明这种效果是数据库和统计建模的产物,而不是搜索引擎的产物。我们研究并讨论了各种限制数据库大小和去除非编码目标序列的方法,这些方法考虑了生成的更改后的统计估计和报告的 PSM。对于从事蛋白质基因组学的小组来说,这些结果非常重要,目的是在控制假阳性的同时,最大化对测序基因组中基因结构的验证和发现。