Department of Proteomics, UFZ, Helmholtz-Centre for Environmental Research Leipzig, 04318 Leipzig, Germany.
J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9.
Correct annotation of protein coding genes is the basis of conventional data analysis in proteomic studies. Nevertheless, most protein sequence databases almost exclusively rely on gene finding software and inevitably also miss protein annotations or possess errors. Proteogenomics tries to overcome these issues by matching MS data directly against a genome sequence database. Here we report an in-depth proteogenomics study of Helicobacter pylori strain 26695. MS data was searched against a combined database of the NCBI annotations and a six-frame translation of the genome. Database searches with Mascot and X! Tandem revealed 1115 proteins identified by at least two peptides with a peptide false discovery rate below 1%. This represents 71% of the predicted proteome. So far this is the most extensive proteome study of Helicobacter pylori. Our proteogenomic approach unambiguously identified four previously missed annotations and furthermore allowed us to correct sequences of six annotated proteins. Since secreted proteins are often involved in pathogenic processes we further investigated signal peptidase cleavage sites. By applying a database search that accommodates the identification of semi-specific cleaved peptides, 63 previously unknown signal peptides were detected. The motif LXA showed to be the predominant recognition sequence for signal peptidases.
The results of MS-based proteomic studies highly rely on correct annotation of protein coding genes which is the basis of conventional data analysis. However, the annotation of protein coding sequences in genomic data is usually based on gene finding software. These tools are limited in their prediction accuracy such as the problematic determination of exact gene boundaries. Thus, protein databases own partly erroneous or incomplete sequences. Additionally, some protein sequences might also be missing in the databases. Proteogenomics, a combination of proteomic and genomic data analyses, is well suited to detect previously not annotated proteins and to correct erroneous sequences. For this purpose, the existing database of the investigated species is typically supplemented with a six-frame translation of the genome. Here, we studied the proteome of the major human pathogen Helicobacter pylori that is responsible for many gastric diseases such as duodenal ulcers and gastric cancer. Our in-depth proteomic study highly reliably identified 1115 proteins (FDR<0.01%) by at least two peptides (FDR<1%) which represent 71% of the predicted proteome deposited at NCBI. The proteogenomic data analysis of our data set resulted in the unambiguous identification of four previously missed annotations, the correction of six annotated proteins as well as the detection of 63 previously unknown signal peptides. We have annotated proteins of particular biological interest like the ferrous iron transport protein A, the coiled-coil-rich protein HP0058 and the lipopolysaccharide biosynthesis protein HP0619. For instance, the protein HP0619 could be a drug target for the inhibition of the LPS synthesis pathway. Furthermore it has been proven that the motif "LXA" is the predominant recognition sequence for the signal peptidase I of H. pylori. Signal peptidases are essential enzymes for the viability of bacterial cells and are involved in pathogenesis. Therefore signal peptidases could be novel targets for antibiotics. The inclusion of the corrected and new annotated proteins as well as the information of signal peptide cleavage sites will help in the study of biological pathways involved in pathogenesis or drug response of H. pylori.
蛋白质编码基因的正确注释是蛋白质组学研究中常规数据分析的基础。然而,大多数蛋白质序列数据库几乎完全依赖于基因发现软件,并且不可避免地也会错过蛋白质注释或存在错误。蛋白质基因组学试图通过将 MS 数据直接与基因组序列数据库匹配来克服这些问题。在这里,我们报告了对幽门螺杆菌 26695 菌株的深入蛋白质基因组学研究。MS 数据针对 NCBI 注释的组合数据库和基因组的六框翻译进行了搜索。Mascot 和 X!Tandem 的数据库搜索揭示了 1115 种蛋白质,这些蛋白质至少被两种肽鉴定,肽假阳性率低于 1%。这代表了预测蛋白质组的 71%。到目前为止,这是对幽门螺杆菌进行的最广泛的蛋白质组学研究。我们的蛋白质基因组学方法明确鉴定了四个以前错过的注释,并且还允许我们纠正六个注释蛋白质的序列。由于分泌蛋白通常参与发病过程,因此我们进一步研究了信号肽切割位点。通过应用可识别半特异性切割肽的数据库搜索,检测到 63 个先前未知的信号肽。LXA 基序被证明是信号肽酶的主要识别序列。
基于 MS 的蛋白质组学研究的结果高度依赖于蛋白质编码基因的正确注释,这是常规数据分析的基础。然而,基因组数据中蛋白质编码序列的注释通常基于基因发现软件。这些工具在其预测准确性方面存在局限性,例如确定确切的基因边界存在问题。因此,蛋白质数据库拥有部分错误或不完整的序列。此外,数据库中可能还缺少一些蛋白质序列。蛋白质基因组学是蛋白质组学和基因组学数据分析的结合,非常适合检测以前未注释的蛋白质并纠正错误的序列。为此,通常用基因组的六框翻译补充研究物种的现有数据库。在这里,我们研究了主要人类病原体幽门螺杆菌的蛋白质组,它是许多胃部疾病(如十二指肠溃疡和胃癌)的罪魁祸首。我们深入的蛋白质组学研究高度可靠地鉴定了 1115 种蛋白质(FDR<0.01%),这些蛋白质至少被两种肽鉴定(FDR<1%),代表了在 NCBI 中预测的蛋白质组的 71%。我们对数据集的蛋白质基因组数据分析导致明确鉴定了四个以前错过的注释,纠正了六个注释的蛋白质以及检测到 63 个以前未知的信号肽。我们已经注释了具有特殊生物学意义的蛋白质,例如亚铁转运蛋白 A、富含卷曲螺旋的蛋白 HP0058 和脂多糖生物合成蛋白 HP0619。例如,HP0619 蛋白可以成为抑制 LPS 合成途径的药物靶标。此外,已经证明“LXA”基序是幽门螺杆菌信号肽酶 I 的主要识别序列。信号肽酶是细菌细胞活力所必需的酶,并且参与发病机制。因此,信号肽酶可能是新型抗生素的靶标。包含已更正和新注释的蛋白质以及信号肽切割位点的信息将有助于研究与发病机制或药物反应相关的生物途径。