Kumar Dhirendra, Jain Aradhya, Dash Debasis
G. N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology , South Campus, Sukhdev Vihar, Mathura Road, Delhi 110025, India.
J Proteome Res. 2015 Dec 4;14(12):4949-58. doi: 10.1021/acs.jproteome.5b00728. Epub 2015 Oct 5.
The missing human proteome comprises predicted protein-coding genes with no credible protein level evidence detected so far and constitutes ~18% of the human protein coding genes (neXtProt release 19/9/2014). The missing proteins may be of pharmacological interest as many of these are membrane receptors, thus requiring comprehensive characterization. In the present study, we explored various computational parameters, crucial during protein searches from tandem mass spectrometry (MS) data, for their impact on missing protein identification. Variables taken into consideration are differences in search database composition, shared peptides, semitryptic searches, post-translational modifications (PTMs), and transcriptome guided proteogenomic searches. We used a multialgorithmic approach for protein detection from publicly available mass spectra from recent studies covering diverse human tissues and cell types. Using the aforementioned approaches, we successfully detected 24 missing proteins (22-PE2, 1-PE4, and 1-PE5). Maximum of these identifications could be attributed to differences in reference proteome databases, exemplifying use of a single standard database for human protein detection from MS data. Our results suggest that search strategies with modified parameters can be rewarding alternatives for extensive profiling of missing proteins. We conclude that using complementary spectral data searches incorporating different parameters like PTMs, against a comprehensive and compact search database, might lead to discoveries of the proteins attributed so far as the missing human proteome.
缺失的人类蛋白质组包含那些目前尚未检测到可靠蛋白质水平证据的预测蛋白质编码基因,约占人类蛋白质编码基因的18%(neXtProt 2014年9月19日发布)。这些缺失的蛋白质可能具有药理学意义,因为其中许多是膜受体,因此需要全面表征。在本研究中,我们探讨了在从串联质谱(MS)数据进行蛋白质搜索过程中至关重要的各种计算参数,以了解它们对缺失蛋白质鉴定的影响。考虑的变量包括搜索数据库组成的差异、共享肽段、半胰蛋白酶搜索、翻译后修饰(PTM)以及转录组引导的蛋白质基因组搜索。我们采用多算法方法从近期涵盖多种人类组织和细胞类型的公开质谱数据中检测蛋白质。使用上述方法,我们成功检测到24种缺失蛋白质(22种为PE2型,1种为PE4型,1种为PE5型)。这些鉴定结果大多可归因于参考蛋白质组数据库的差异,这例证了使用单一标准数据库从MS数据中检测人类蛋白质的情况。我们的结果表明,采用修改参数后的搜索策略可能是对缺失蛋白质进行广泛分析的有益替代方法。我们得出结论,针对一个全面且紧凑的搜索数据库,使用包含不同参数(如PTM)的互补光谱数据搜索,可能会发现迄今被视为缺失人类蛋白质组的那些蛋白质。