Zhang Jiyang, Ma Jie, Dou Lei, Wu Songfeng, Qian Xiaohong, Xie Hongwei, Zhu Yunping, He Fuchu
State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China.
Mol Cell Proteomics. 2009 Mar;8(3):547-57. doi: 10.1074/mcp.M700558-MCP200. Epub 2008 Nov 12.
Tandem mass spectrometry combined with database searching allows high throughput identification of peptides in shotgun proteomics. However, validating database search results, a problem with a lot of solutions proposed, is still advancing in some aspects, such as the sensitivity, specificity, and generalizability of the validation algorithms. Here a Bayesian nonparametric (BNP) model for the validation of database search results was developed that incorporates several popular techniques in statistical learning, including the compression of feature space with a linear discriminant function, the flexible nonparametric probability density function estimation for the variable probability structure in complex problem, and the Bayesian method to calculate the posterior probability. Importantly the BNP model is compatible with the popular target-decoy database search strategy naturally. We tested the BNP model on standard proteins and real, complex sample data sets from multiple MS platforms and compared it with Peptide-Prophet, the cutoff-based method, and a simple nonparametric method (proposed by us previously). The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.
串联质谱与数据库搜索相结合,可在鸟枪法蛋白质组学中实现肽段的高通量鉴定。然而,验证数据库搜索结果这一存在诸多解决方案的问题,在某些方面仍有待改进,比如验证算法的灵敏度、特异性和通用性。本文开发了一种用于验证数据库搜索结果的贝叶斯非参数(BNP)模型,该模型融合了统计学习中的几种常用技术,包括用线性判别函数压缩特征空间、针对复杂问题中可变概率结构的灵活非参数概率密度函数估计以及用于计算后验概率的贝叶斯方法。重要的是,BNP模型自然地与流行的目标-诱饵数据库搜索策略兼容。我们在标准蛋白质以及来自多个质谱平台的真实复杂样本数据集上测试了BNP模型,并将其与肽段先知(Peptide-Prophet)、基于截断值的方法以及一种简单的非参数方法(我们之前提出的)进行了比较。结果表明,BNP模型在所有搜索数据集的灵敏度和通用性方面表现更优。BNP模型检测到了一些被其他方法过滤掉的高质量匹配,并以高概率进行了赋值。因此,BNP模型能够有效地验证数据库搜索结果,并从串联质谱数据中提取更多信息。