McIlwain Sean, Page David, Huttlin Edward L, Sussman Michael R
Department of Computer Sciences, University of Wisconsin, Madison, WI, USA.
Bioinformatics. 2007 Jul 1;23(13):i328-36. doi: 10.1093/bioinformatics/btm198.
This article presents a method to identify the isotopic distributions within a mass spectrum using a probabilistic classifier supplemented with dynamic programming. Such a system is needed for a variety of purposes, including generating robust and meaningful features from mass spectra to be used in classification.
The primary result of this article is that the dynamic programming approach significantly improves sensitivity, without harming specificity, of a probabilistic classifier for identifying the isotopic distributions. When annotating isotopic distributions where an expert has performed the initial 'peak-picking' (removal of noise peaks), the dynamic programming approach gives a true positive rate of 96% and a false positive rate of 0.0%, whereas the classifier alone has a true positive rate of only 47% when the false positive rate is 0.0%. When annotating isotopic distributions in machine peak-picked spectra, which may contain many noise peaks, the dynamic programming approach gives a true positive rate of only 22.0%, but it still keeps a low false positive rate of 1.0% and still outperforms the classifier alone. It is important to note that all these rates are when we require exact matches with the distributions in annotated spectra; in our evaluation a distribution is considered 'entirely incorrect' if it is missing even one peak or contains even one extraneous peak. We compared to the THRASH and AID-MS systems using a looser requirement: correctly identifying the distribution that contains the mono-isotopic mass. Under this measure, our dynamic programming approach achieves a true positive rate of 82% and a false positive rate of 1%, which again outperforms the classifier alone. The dynamic programming approach ends up being more conservative than THRASH and AID-MS, yielding both fewer true and false peaks, but the F-score of the dynamic programming approach is significantly better than those of THRASH and AID-MS. All results were obtained with 10-fold cross-validation of 99 sections of mass spectra with a total of 214 hand-annotated isotopic distributions.
Programs are available via http://www.cs.wisc.edu/~mcilwain/IDM.
本文提出了一种使用概率分类器并辅以动态规划来识别质谱图中同位素分布的方法。出于多种目的需要这样一个系统,包括从质谱图中生成稳健且有意义的特征以用于分类。
本文的主要结果是,动态规划方法显著提高了用于识别同位素分布的概率分类器的灵敏度,同时不损害特异性。当注释由专家进行初始“峰挑选”(去除噪声峰)的同位素分布时,动态规划方法的真阳性率为96%,假阳性率为0.0%,而仅使用分类器时,在假阳性率为0.0%的情况下,真阳性率仅为47%。当注释机器挑选峰的质谱图中的同位素分布时,这些质谱图可能包含许多噪声峰,动态规划方法的真阳性率仅为22.0%,但它仍保持1.0%的低假阳性率,并且仍然优于仅使用分类器的情况。需要注意的是,所有这些比率都是在我们要求与注释光谱中的分布完全匹配时的情况;在我们的评估中,如果一个分布缺少哪怕一个峰或包含哪怕一个额外的峰,都被认为是“完全错误的”。我们使用较宽松的要求与THRASH和AID-MS系统进行比较:正确识别包含单同位素质量的分布。在此度量下,我们的动态规划方法实现了82%的真阳性率和1%的假阳性率,这再次优于仅使用分类器的情况。动态规划方法最终比THRASH和AID-MS更保守,产生的真峰和假峰都更少,但动态规划方法的F分数明显优于THRASH和AID-MS。所有结果均通过对99个质谱图部分进行10折交叉验证获得,这些质谱图共有214个手动注释的同位素分布。