The Gade Institute, Section for Microbiology and Immunology, University of Bergen, N-5021 Bergen, Norway.
Mol Cell Proteomics. 2011 Jan;10(1):M110.002527. doi: 10.1074/mcp.M110.002527. Epub 2010 Oct 28.
Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software called multistrain mass spectrometry prokaryotic database builder (MSMSpdbb) that can merge protein databases from several sources and be applied on any prokaryotic organism, in a proteomic-friendly approach. We generated a database for the Mycobacterium tuberculosis complex (using three strains of Mycobacterium bovis and five of M. tuberculosis), and analyzed data collected from two laboratory strains and two clinical isolates of M. tuberculosis. We identified 2561 proteins, of which 24 were present in M. tuberculosis H37Rv samples, but not annotated in the M. tuberculosis H37Rv genome. We were also able to identify 280 nonsynonymous single amino acid polymorphisms and confirm 367 translational start sites. As a proof of concept we applied the database to whole-genome DNA sequencing data of one of the clinical isolates, which allowed the validation of 116 predicted single amino acid polymorphisms and the annotation of 131 N-terminal start sites. Moreover we identified regions not present in the original M. tuberculosis H37Rv sequence, indicating strain divergence or errors in the reference sequence. In conclusion, we demonstrated the potential of using a merged database to better characterize laboratory or clinical bacterial strains.
精确注释基因或开放阅读框仍然是一项艰巨的任务,即使对于来自同一基因组序列的数据也会导致分歧。这对进一步的蛋白质组学研究有影响,也会影响对具有许多特定遗传变异的临床分离株的特征描述,而这些变异可能在所选数据库中没有得到体现。我们最近开发了一种名为多株质谱原核数据库构建器(MSMSpdbb)的软件,可以合并来自多个来源的蛋白质数据库,并以蛋白质组学友好的方式应用于任何原核生物。我们为结核分枝杆菌复合体生成了一个数据库(使用了 3 株牛分枝杆菌和 5 株结核分枝杆菌),并分析了来自两个实验室株和两个结核分枝杆菌临床分离株的数据。我们鉴定了 2561 种蛋白质,其中 24 种存在于结核分枝杆菌 H37Rv 样本中,但在结核分枝杆菌 H37Rv 基因组中未注释。我们还能够鉴定出 280 个非同义单氨基酸多态性,并确认了 367 个翻译起始位点。作为概念验证,我们将该数据库应用于其中一个临床分离株的全基因组 DNA 测序数据,从而验证了 116 个预测的单氨基酸多态性,并注释了 131 个 N 端起始位点。此外,我们还鉴定了在原始结核分枝杆菌 H37Rv 序列中不存在的区域,表明菌株的差异或参考序列的错误。总之,我们证明了使用合并数据库更好地描述实验室或临床细菌株的潜力。