McIlwain Sean, Page David, Huttlin Edward L, Sussman Michael R
Department of Computer Sciences, University of Wisconsin, Madison, WI, USA.
Bioinformatics. 2008 Jul 1;24(13):i339-47. doi: 10.1093/bioinformatics/btn190.
In recent years stable isotopic labeling has become a standard approach for quantitative proteomic analyses. Among the many available isotopic labeling strategies, metabolic labeling is attractive for the excellent internal control it provides. However, analysis of data from metabolic labeling experiments can be complicated because the spacing between labeled and unlabeled forms of each peptide depends on its sequence, and is thus variable from analyte to analyte. As a result, one generally needs to know the sequence of a peptide to identify its matching isotopic distributions in an automated fashion. In some experimental situations it would be necessary or desirable to match pairs of labeled and unlabeled peaks from peptides of unknown sequence. This article addresses this largely overlooked problem in the analysis of quantitative mass spectrometry data by presenting an algorithm that not only identifies isotopic distributions within a mass spectrum, but also annotates matches between natural abundance light isotopic distributions and their metabolically labeled counterparts. This algorithm is designed in two stages: first we annotate the isotopic peaks using a modified version of the IDM algorithm described last year; then we use a probabilistic classifier that is supplemented by dynamic programming to find the metabolically labeled matched isotopic pairs. Such a method is needed for high-throughput quantitative proteomic metabolomic experiments measured via mass spectrometry.
The primary result of this article is that the dynamic programming approach performs well given perfect isotopic distribution annotations. Our algorithm achieves a true positive rate of 99% and a false positive rate of 1% using perfect isotopic distribution annotations. When the isotopic distributions are annotated given 'expert' selected peaks, the same algorithm gets a true positive rate of 77% and a false positive rate of 1%. Finally, when annotating using machine selected peaks, which may contain noise, the dynamic programming algorithm gives a true positive rate of 36% and a false positive rate of 1%. It is important to mention that these rates arise from the requirement of exact annotations of both the light and heavy isotopic distributions. In our evaluations, a match is considered 'entirely incorrect' if it is missing even one peak or containing an extraneous peak. If we only require that the 'monoisotopic' peaks exist within the two matched distributions, our algorithm obtains a positive rate of 45% and a false positive rate of 1% on the 'machine' selected data. Changes to the algorithm's scoring function and training example generation improves our 'monoisotopic' peak score true positive rate to 65% while obtaining a false positive rate of 2%. All results were obtained within 10-fold cross-validation of 41 mass spectra with a mass-to-charge range of 800-4000 m/z. There are a total of 713 isotopic distributions and 255 matched isotopic pairs that are hand-annotated for this study.
Programs are available via http://www.cs.wisc.edu/~mcilwain/IDM/.
近年来,稳定同位素标记已成为定量蛋白质组分析的标准方法。在众多可用的同位素标记策略中,代谢标记因其提供的出色内部对照而颇具吸引力。然而,代谢标记实验数据的分析可能会很复杂,因为每种肽的标记形式和未标记形式之间的间距取决于其序列,因此不同分析物之间存在差异。因此,通常需要知道肽的序列才能以自动化方式识别其匹配的同位素分布。在某些实验情况下,有必要或希望匹配来自未知序列肽的标记峰和未标记峰对。本文通过提出一种算法来解决定量质谱数据分析中这个很大程度上被忽视的问题,该算法不仅能识别质谱图中的同位素分布,还能注释天然丰度轻同位素分布与其代谢标记对应物之间的匹配。此算法分两个阶段设计:首先,我们使用去年描述的IDM算法的修改版本注释同位素峰;然后,我们使用概率分类器,并辅以动态规划来找到代谢标记的匹配同位素对。对于通过质谱测量的高通量定量蛋白质组代谢组实验,需要这样一种方法。
本文的主要结果是,在同位素分布注释完美的情况下,动态规划方法表现良好。使用完美的同位素分布注释,我们的算法实现了99%的真阳性率和1%的假阳性率。当根据“专家”选择的峰注释同位素分布时,相同算法的真阳性率为77%,假阳性率为1%。最后,当使用可能包含噪声的机器选择的峰进行注释时,动态规划算法的真阳性率为36%,假阳性率为1%。需要指出的是,这些比率源于对轻、重同位素分布精确注释的要求。在我们的评估中,如果一个匹配缺少哪怕一个峰或包含一个额外的峰,就会被认为“完全错误”。如果我们只要求两个匹配分布中存在“单同位素”峰,我们的算法在“机器”选择的数据上获得了45%的阳性率和1%的假阳性率。对算法评分函数和训练示例生成的更改将我们的“单同位素”峰得分真阳性率提高到65%,同时假阳性率为2%。所有结果均在41个质谱图的10倍交叉验证中获得,质荷比范围为800 - 4000 m/z。本研究共手动注释了713个同位素分布和255个匹配的同位素对。