Cannon William R, Jarman Kristin H, Webb-Robertson Bobbie-Jo M, Baxter Douglas J, Oehmen Christopher S, Jarman Kenneth D, Heredia-Langner Alejandro, Auberry Kenneth J, Anderson Gordon A
Computational Biology and Bioinformatics Group, Computational and Information Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA 99352, USA.
J Proteome Res. 2005 Sep-Oct;4(5):1687-98. doi: 10.1021/pr050147v.
We evaluate statistical models used in two-hypothesis tests for identifying peptides from tandem mass spectrometry data. The null hypothesis H(0), that a peptide matches a spectrum by chance, requires information on the probability of by-chance matches between peptide fragments and peaks in the spectrum. Likewise, the alternate hypothesis H(A), that the spectrum is due to a particular peptide, requires probabilities that the peptide fragments would indeed be observed if it was the causative agent. We compare models for these probabilities by determining the identification rates produced by the models using an independent data set. The initial models use different probabilities depending on fragment ion type, but uniform probabilities for each ion type across all of the labile bonds along the backbone. More sophisticated models for probabilities under both H(A) and H(0) are introduced that do not assume uniform probabilities for each ion type. In addition, the performance of these models using a standard likelihood model is compared to an information theory approach derived from the likelihood model. Also, a simple but effective model for incorporating peak intensities is described. Finally, a support-vector machine is used to discriminate between correct and incorrect identifications based on multiple characteristics of the scoring functions. The results are shown to reduce the misidentification rate significantly when compared to a benchmark cross-correlation based approach.
我们评估了用于双假设检验以从串联质谱数据中识别肽段的统计模型。原假设H(0)为肽段与谱图偶然匹配,这需要肽段片段与谱图中峰的偶然匹配概率的信息。同样,备择假设H(A)为谱图归因于特定肽段,这需要如果该肽段是致病因子时确实会观察到肽段片段的概率。我们通过使用独立数据集确定模型产生的识别率来比较这些概率的模型。初始模型根据碎片离子类型使用不同的概率,但对于沿主链的所有不稳定键,每种离子类型的概率是统一的。引入了在H(A)和H(0)下更复杂的概率模型,这些模型不假设每种离子类型的概率是统一的。此外,将这些使用标准似然模型的模型的性能与从似然模型推导的信息论方法进行了比较。此外,还描述了一种用于纳入峰强度的简单但有效的模型。最后,使用支持向量机基于评分函数的多个特征来区分正确和错误的识别。结果表明,与基于基准互相关的方法相比,错误识别率显著降低。