Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA.
Mol Cell Proteomics. 2012 Oct;11(10):933-44. doi: 10.1074/mcp.M112.019471. Epub 2012 Jul 5.
Peptide sequencing by computational assignment of tandem mass spectra to a database of putative protein sequences provides an independent approach to confirming or refuting protein predictions based on large-scale DNA and RNA sequencing efforts. This use of mass spectrometrically-derived sequence data for testing and refining predicted gene models has been termed proteogenomics. We report herein the application of proteogenomic methodology to a database of 10.9 million tandem mass spectra collected over a period of two years from proteolytically generated peptides isolated from the model legume Medicago truncatula. These spectra were searched against a database of predicted M. truncatula protein sequences generated from public databases, in silico gene model predictions, and a whole-genome six-frame translation. This search identified 78,647 distinct peptide sequences, and a comparison with the publicly available proteome from the recently published M. truncatula genome supported translation of 9,843 existing gene models and identified 1,568 novel peptides suggesting corrections or additions to the current annotations. Each supporting and novel peptide was independently validated using mRNA-derived deep sequencing coverage and an overall correlation of 93% between the two data types was observed. We have additionally highlighted examples of several aspects of structural annotation for which tandem MS provides unique evidence not easily obtainable through typical DNA or RNA sequencing. Proteogenomic analysis is a valuable and unique source of information for the structural annotation of genomes and should be included in such efforts to ensure that the genome models used by biologists mirror as accurately as possible what is present in the cell.
通过将串联质谱分配给假定蛋白质序列数据库来对肽进行测序,为基于大规模 DNA 和 RNA 测序工作的蛋白质预测的确认或反驳提供了一种独立的方法。这种使用质谱衍生的序列数据来测试和完善预测的基因模型的方法被称为蛋白质组学。我们在此报告了蛋白质组学方法在数据库中的应用,该数据库包含了两年间从模式豆科植物蒺藜苜蓿中分离的蛋白水解肽产生的 1090 万个串联质谱。这些光谱与从公共数据库、计算机基因模型预测和全基因组六框翻译中生成的预测 M. truncatula 蛋白质序列数据库进行了搜索。该搜索确定了 78647 个独特的肽序列,与最近发表的 M. truncatula 基因组中公开的蛋白质组进行比较,支持了 9843 个现有基因模型的翻译,并鉴定了 1568 个新肽,提示对当前注释进行更正或添加。每个支持肽和新肽都使用 mRNA 衍生的深度测序覆盖率进行了独立验证,两种数据类型之间的总体相关性为 93%。我们还强调了串联 MS 提供独特证据的几个结构注释方面的示例,这些证据不易通过典型的 DNA 或 RNA 测序获得。蛋白质组学分析是基因组结构注释的有价值且独特的信息来源,应包含在这些努力中,以确保生物学家使用的基因组模型尽可能准确地反映细胞中存在的情况。