Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, California 92092;
Mol Cell Proteomics. 2014 Jan;13(1):157-67. doi: 10.1074/mcp.M113.031260. Epub 2013 Oct 18.
New technologies in genomics and proteomics have influenced the emergence of proteogenomics, a field at the confluence of genomics, transcriptomics, and proteomics. First generation proteogenomic toolkits employ peptide mass spectrometry to identify novel protein coding regions. We extend first generation proteogenomic tools to achieve greater accuracy and enable the analysis of large, complex genomes. We apply our pipeline to Zea mays, which has a genome comparable in size to human. Our pipeline begins with the comparison of mass spectra to a putative translation of the genome. We select novel peptides, those that match a region of the genome that was not previously known to be protein coding, for grouping into refinement events. We present a novel, probabilistic framework for evaluating the accuracy of each event. Our calculated event probability, or eventProb, considers the number of supporting peptides and spectra, and the quality of each supporting peptide-spectrum match. Our pipeline predicts 165 novel protein-coding genes and proposes updated models for 741 additional genes.
基因组学和蛋白质组学领域的新技术推动了蛋白质基因组学的出现,蛋白质基因组学是基因组学、转录组学和蛋白质组学的交汇点。第一代蛋白质基因组学工具包采用肽质量色谱法来鉴定新的蛋白质编码区域。我们扩展了第一代蛋白质基因组学工具,以提高准确性并实现对大型复杂基因组的分析。我们将我们的方法应用于玉米,其基因组大小与人类相当。我们的方法从将质谱与基因组的假定翻译进行比较开始。我们选择新的肽,即与先前未知的蛋白质编码区域匹配的肽,将其分组到精细事件中。我们提出了一种新颖的、概率性的框架来评估每个事件的准确性。我们计算的事件概率或 eventProb 考虑了支持肽和谱的数量,以及每个支持肽谱匹配的质量。我们的方法预测了 165 个新的蛋白质编码基因,并为 741 个额外基因提出了更新的模型。