Institute of Bioinformatics, International Technology Park, Bangalore, India.
Mol Cell Proteomics. 2011 Dec;10(12):M111.011627. doi: 10.1074/mcp.M111.011445. Epub 2011 Oct 3.
The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and tandem MS levels. In all, we identified 3176 proteins from Mycobacterium tuberculosis representing ~80% of its total predicted gene count. In addition to protein database search, we carried out a genome database search, which led to identification of ~250 novel peptides. Based on these novel genome search-specific peptides, we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes.
结核分枝杆菌 H37Rv 株的基因组测序于 1998 年完成,随后于 2002 年对临床分离株 CDC1551 进行了全基因组测序。此后,许多其他菌株的基因组序列已经可用,使其成为在基因组水平上研究得较好的致病性细菌物种之一。然而,由于 GC 含量高和与其他模式原核生物的差异,其基因组注释仍然具有挑战性。为此,我们使用傅里叶变换质谱法在 MS 和串联 MS 水平上进行了高分辨率分析,对结核分枝杆菌 H37Rv 株进行了深入的蛋白质基因组分析。总共从结核分枝杆菌中鉴定出 3176 种蛋白质,约占其总预测基因数的 80%。除了蛋白质数据库搜索外,我们还进行了基因组数据库搜索,这导致鉴定了约 250 种新肽。基于这些新的基因组搜索特异性肽,我们在 H37Rv 基因组中发现了 41 个新的蛋白质编码基因。基于这些新的基因组搜索特异性肽,我们在 H37Rv 基因组中发现了 41 个新的蛋白质编码基因。基于这些新的基因组搜索特异性肽,我们在 H37Rv 基因组中发现了 41 个新的蛋白质编码基因。基于这些新的基因组搜索特异性肽,我们在 H37Rv 基因组中发现了 41 个新的蛋白质编码基因。使用肽证据和替代基因预测工具,我们还纠正了 79 个基因模型。最后,N 端衍生肽的质谱数据证实了 727 个现有翻译起始位点注释,同时纠正了 33 个蛋白质的注释。我们首次报道了一种通过在前体和片段检测步骤中使用高分辨率串联质谱获得的结核分枝杆菌基因组中高度可信的蛋白质编码区集。这种蛋白质基因组学方法应该普遍适用于已经测序其基因组的其他生物体,以获得更准确的蛋白质编码基因目录。