Alves Gelio, Wu Wells W, Wang Guanghui, Shen Rong-Fong, Yu Yi-Kuo
National Center for Biotechnology Information, Library of Medicine, NIH, Bethesda, MD 20894, USA.
J Proteome Res. 2008 Aug;7(8):3102-13. doi: 10.1021/pr700798h. Epub 2008 Jun 18.
Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.
可靠的肽段鉴定是基于质谱的蛋白质组学中最重要的组成部分之一。我们提出了一种方法,将不同数据库搜索方法的结果进行合理整合,以提高肽段鉴定的准确性。我们分析中纳入的数据库搜索方法包括SEQUEST(v27 rev12)、ProbID(v1.0)、InsPecT(v20060505)、Mascot(v2.1)、X! Tandem(v2007.07.01.2)、OMSSA(v2.0)和RAId_DbS。使用两个数据集,一个以profile模式收集,另一个以centroid模式收集,我们测试了两种搜索方法的所有21种组合以及三种搜索方法的所有35种可能组合的搜索性能。我们研究获得的结果表明,合理组合搜索方法确实能提高检索准确性。除了性能结果,我们还描述了理论框架,原则上该框架允许人们将包括从头测序和谱图库搜索在内的许多独立评分方法进行组合。还从共同真阳性、共同假阳性和全局分析的角度研究了不同方法之间的相关性。我们发现,在所研究的七种方法的任何成对组合之间,平均相关强度通常小于相关的标准误差。这表明不同方法之间可能仅存在弱相关性,并验证了我们组合搜索结果的方法。通过表明假阳性肽段的平均累积数量与组合E值相当吻合,进一步证实了我们方法的有效性。本研究相关数据可根据要求免费获取。