一种使用串联质谱数据和蛋白质序列数据库进行蛋白质鉴定与验证的超几何概率模型。

A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.

作者信息

Sadygov Rovshan G, Yates John R

机构信息

Department of Cell Biology, SR11, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA.

出版信息

Anal Chem. 2003 Aug 1;75(15):3792-8. doi: 10.1021/ac034157w.

DOI:10.1021/ac034157w

PMID:14572045

Abstract

We present a new probability-based method for protein identification using tandem mass spectra and protein databases. The method employs a hypergeometric distribution to model frequencies of matches between fragment ions predicted for peptide sequences with a specific (M + H)+ value (at some mass tolerance) in a protein sequence database and an experimental tandem mass spectrum. The hypergeometric distribution constitutes null hypothesis-all peptide matches to a tandem mass spectrum are random. It is used to generate a score characterizing the randomness of a database sequence match to an experimental tandem mass spectrum and to determine the level of significance of the null hypothesis. For each tandem mass spectrum and database search, a peptide is identified that has the least probability of being a random match to the spectrum and the corresponding level of significance of the null hypothesis is determined. To check the validity of the hypergeometric model in describing fragment ion matches, we used chi2 test. The distribution of frequencies and corresponding hypergeometric probabilities are generated for each tandem mass spectrum. No proteolytic cleavage specificity is used to create the peptide sequences from the database. We do not use any empirical probabilities in this method. The scores generated by the hypergeometric model do not have a significant molecular weight bias and are reasonably independent of database size. The approach has been implemented in a database search algorithm, PEP_PROBE. By using a large set of tandem mass spectra derived from a set of peptides created by digestion of a collection of known proteins using four different proteases, a false positive rate of 5% is demonstrated.

摘要

我们提出了一种基于概率的新方法，用于利用串联质谱和蛋白质数据库进行蛋白质鉴定。该方法采用超几何分布来模拟在蛋白质序列数据库中针对具有特定（M + H）+值（在一定质量容差范围内）的肽序列预测的碎片离子与实验串联质谱之间的匹配频率。超几何分布构成了零假设——所有与串联质谱的肽匹配都是随机的。它用于生成一个分数，表征数据库序列与实验串联质谱匹配的随机性，并确定零假设的显著性水平。对于每个串联质谱和数据库搜索，鉴定出与该谱随机匹配概率最小的肽，并确定零假设相应的显著性水平。为了检验超几何模型在描述碎片离子匹配方面的有效性，我们使用了卡方检验。为每个串联质谱生成频率分布和相应的超几何概率。在从数据库创建肽序列时不使用蛋白水解切割特异性。在该方法中我们不使用任何经验概率。超几何模型生成的分数没有明显的分子量偏差，并且在很大程度上与数据库大小无关。该方法已在数据库搜索算法PEP_PROBE中实现。通过使用大量由使用四种不同蛋白酶消化一组已知蛋白质产生的肽所得到的串联质谱，证明了5%的假阳性率。

相似文献

A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.

Anal Chem. 2003 Aug 1;75(15):3792-8. doi: 10.1021/ac034157w.

Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases.

Anal Chem. 2004 Mar 15;76(6):1664-71. doi: 10.1021/ac035112y.

MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis.

J Proteome Res. 2007 Feb;6(2):654-61. doi: 10.1021/pr0604054.

The probability distribution for a random match between an experimental-theoretical spectral pair in tandem mass spectrometry.

J Bioinform Comput Biol. 2005 Apr;3(2):455-76. doi: 10.1142/s0219720005001120.

Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies.

Bioinformatics. 2007 Sep 1;23(17):2210-7. doi: 10.1093/bioinformatics/btm267. Epub 2007 May 17.

A Heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results.

Mol Cell Proteomics. 2005 Jun;4(6):762-72. doi: 10.1074/mcp.M400215-MCP200. Epub 2005 Feb 9.

Central limit theorem as an approximation for intensity-based scoring function.

Anal Chem. 2006 Jan 1;78(1):89-95. doi: 10.1021/ac051206r.

Probability-based validation of protein identifications using a modified SEQUEST algorithm.

Anal Chem. 2002 Nov 1;74(21):5593-9. doi: 10.1021/ac025826t.

Properties of average score distributions of SEQUEST: the probability ratio method.

Mol Cell Proteomics. 2008 Jun;7(6):1135-45. doi: 10.1074/mcp.M700239-MCP200. Epub 2008 Feb 25.

SILVER helps assign peptides to tandem mass spectra using intensity-based scoring.

J Am Soc Mass Spectrom. 2004 Jun;15(6):910-2. doi: 10.1016/j.jasms.2004.02.011.

引用本文的文献

Food co-consumption network as a new approach to dietary pattern in non-alcoholic fatty liver disease.

Sci Rep. 2023 Nov 24;13(1):20703. doi: 10.1038/s41598-023-47752-y.

Analysing Complex Oral Protein Samples: Complete Workflow and Case Analysis of Salivary Pellicles.

J Clin Med. 2021 Jun 25;10(13):2801. doi: 10.3390/jcm10132801.

CIDer: A Statistical Framework for Interpreting Differences in CID and HCD Fragmentation.

J Proteome Res. 2021 Apr 2;20(4):1951-1965. doi: 10.1021/acs.jproteome.0c00964. Epub 2021 Mar 17.

Using 10,000 Fragment Ions to Inform Scoring in Native Top-down Proteomics.

J Am Soc Mass Spectrom. 2020 Jul 1;31(7):1398-1409. doi: 10.1021/jasms.0c00026. Epub 2020 Jun 24.

Bayesian Hierarchical Model for Protein Identifications.

J Appl Stat. 2019;46(1):30-46. doi: 10.1080/02664763.2018.1454893. Epub 2018 Mar 25.

Incurred Sample Reanalysis: Time to Change the Sample Size Calculation?

AAPS J. 2019 Feb 11;21(2):28. doi: 10.1208/s12248-019-0293-2.

A Markov Chain Monte Carlo Method for Estimating the Statistical Significance of Proteoform Identifications by Top-Down Mass Spectrometry.

J Proteome Res. 2019 Mar 1;18(3):878-889. doi: 10.1021/acs.jproteome.8b00562. Epub 2019 Jan 28.

Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features.

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):66. doi: 10.1186/s12859-017-1472-8.

Hypergeometric Similarity Measure for Spatial Analysis in Tissue Imaging Mass Spectrometry.

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2011;2011:604-607. doi: 10.1109/BIBM.2011.113.

Software Analysis of Uncorrelated MS1 Peaks for Discovery of Post-Translational Modifications.

J Am Soc Mass Spectrom. 2015 Dec;26(12):2133-40. doi: 10.1007/s13361-015-1229-4. Epub 2015 Aug 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种使用串联质谱数据和蛋白质序列数据库进行蛋白质鉴定与验证的超几何概率模型。

A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献