Siepen Jennifer A, Keevil Emma-Jayne, Knight David, Hubbard Simon J
Faculty of Life Sciences, University of Manchester, M13 9PT, UK.
J Proteome Res. 2007 Jan;6(1):399-408. doi: 10.1021/pr060507u.
Protein identification via peptide mass fingerprinting (PMF) remains a key component of high-throughput proteomics experiments in post-genomic science. Candidate protein identifications are made using bioinformatic tools from peptide peak lists obtained via mass spectrometry (MS). These algorithms rely on several search parameters, including the number of potential uncut peptide bonds matching the primary specificity of the hydrolytic enzyme used in the experiment. Typically, up to one of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using two distinct, nonredundant datasets of peptides identified via PMF and tandem MS, a simple predictive method based on information theory is presented which is able to identify experimentally defined missed cleavages with up to 90% accuracy from amino acid sequence alone. Using this simple protocol, we are able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines, using the PMF dataset as a test set. In addition, the improved approach is also demonstrated on an independent PMF data set of known proteins that also has corresponding high-quality tandem MS data, validating the protein identifications. This approach has wider applicability for proteomics database searching, and the program for predicting missed cleavages and masking Fasta-formatted protein sequence databases has been made available via http:// ispider.smith.man.ac uk/MissedCleave.
通过肽质量指纹图谱(PMF)进行蛋白质鉴定仍然是后基因组科学高通量蛋白质组学实验的关键组成部分。候选蛋白质鉴定是使用生物信息学工具,根据通过质谱(MS)获得的肽峰列表进行的。这些算法依赖于几个搜索参数,包括与实验中使用的水解酶主要特异性相匹配的潜在未切割肽键数量。通常,生物信息学搜索工具会考虑多达一个这样的“漏切”情况,通常是在胰蛋白酶对虚拟蛋白质组进行消化之后。利用通过PMF和串联MS鉴定的两个不同的、非冗余的肽数据集,提出了一种基于信息论的简单预测方法,该方法仅根据氨基酸序列就能以高达90%的准确率识别实验确定的漏切情况。使用这个简单的方案,我们能够“屏蔽”候选蛋白质数据库,以便在虚拟消化时无需考虑可靠的漏切位点。我们表明,这会改善数据库搜索,使用两个不同的搜索引擎,以PMF数据集作为测试集。此外,在一个已知蛋白质的独立PMF数据集上也展示了改进的方法,该数据集也有相应的高质量串联MS数据,从而验证了蛋白质鉴定。这种方法在蛋白质组学数据库搜索中有更广泛的适用性,并且通过http://ispider.smith.man.ac.uk/MissedCleave提供了预测漏切和屏蔽Fasta格式蛋白质序列数据库的程序。