Bern Marshall, Goldberg David, McDonald W Hayes, Yates John R
Palo Alto Research Center, Palo Alto, CA 94304, USA.
Bioinformatics. 2004 Aug 4;20 Suppl 1:i49-54. doi: 10.1093/bioinformatics/bth947.
A powerful proteomics methodology couples high-performance liquid chromatography (HPLC) with tandem mass spectrometry and database-search software, such as SEQUEST. Such a set-up, however, produces a large number of spectra, many of which are of too poor quality to be useful. Hence a filter that eliminates poor spectra before the database search can significantly improve throughput and robustness. Moreover, spectra judged to be of high quality, but that cannot be identified by database search, are prime candidates for still more computationally intensive methods, such as de novo sequencing or wider database searches including post-translational modifications.
We report on two different approaches to assessing spectral quality prior to identification: binary classification, which predicts whether or not SEQUEST will be able to make an identification, and statistical regression, which predicts a more universal quality metric involving the number of b- and y-ion peaks. The best of our binary classifiers can eliminate over 75% of the unidentifiable spectra while losing only 10% of the identifiable spectra. Statistical regression can pick out spectra of modified peptides that can be identified by a de novo program but not by SEQUEST. In a section of independent interest, we discuss intensity normalization of mass spectra.
一种强大的蛋白质组学方法将高效液相色谱(HPLC)与串联质谱以及数据库搜索软件(如SEQUEST)相结合。然而,这样的设置会产生大量的谱图,其中许多质量太差而无法使用。因此,在数据库搜索之前消除质量差的谱图的过滤器可以显著提高通量和稳健性。此外,被判定为高质量但无法通过数据库搜索识别的谱图,是更多计算密集型方法(如从头测序或包括翻译后修饰的更广泛数据库搜索)的主要候选对象。
我们报告了在鉴定之前评估谱图质量的两种不同方法:二元分类,它预测SEQUEST是否能够进行鉴定;以及统计回归,它预测一个更通用的质量指标,涉及b离子峰和y离子峰的数量。我们最好的二元分类器可以消除超过75%无法识别的谱图,同时仅损失10%可识别的谱图。统计回归可以挑选出可通过从头程序识别但不能通过SEQUEST识别的修饰肽的谱图。在一个独立感兴趣的部分中,我们讨论了质谱的强度归一化。