Choo Keng Wah, Tham Wai Mun
Bioinformatics Group, Nanyang Polytechnic, 569830 Singapore, Republic Of Singapore.
BMC Bioinformatics. 2007 Sep 20;8:352. doi: 10.1186/1471-2105-8-352.
Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on de novo sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified.
The proposed method measures the qualities of MS data sets based on the symmetric property of b- and y-ion peaks present in a MS spectrum. Self-convolution on MS data and its time-reversal copy was employed. Due to the symmetric nature of b-ions and y-ions peaks, the self-convolution result of a good spectrum would produce a highest mid point intensity peak. To reduce processing time, self-convolution was achieved using Fast Fourier Transform and its inverse transform, followed by the removal of the "DC" (Direct Current) component and the normalisation of the data set. The quality score was defined as the ratio of the intensity at the mid point to the remaining peaks of the convolution result. The method was validated using both theoretical mass spectra, with various permutations, and several real MS data sets. The results were encouraging, revealing a high percentage of positive prediction rates for spectra with good quality scores.
We have demonstrated in this work a method for determining the quality of tandem MS data set. By pre-determining the quality of tandem MS data before subjecting them to protein identification algorithms, spurious protein predictions due to poor tandem MS data are avoided, giving scientists greater confidence in the predicted results. We conclude that the algorithm performs well and could potentially be used as a pre-processing for all mass spectrometry based protein identification tools.
已经开发了许多算法用于解读串联质谱(MS)数据集。它们基本上可以分为两类。第一类在理论质谱数据库上进行搜索,而第二类基于原始质谱数据进行从头测序。值得注意的是,质谱的质量在这两种情况下都对蛋白质鉴定过程有显著影响。这促使作者探索在将MS数据集应用于蛋白质鉴定算法之前测量其质量的方法,从而实现更有意义的搜索并提高所鉴定蛋白质的置信度。
所提出的方法基于MS谱中存在的b离子和y离子峰的对称特性来测量MS数据集的质量。对MS数据及其时间反转副本进行自卷积。由于b离子和y离子峰的对称性质,良好谱的自卷积结果将产生最高的中点强度峰。为了减少处理时间,使用快速傅里叶变换及其逆变换实现自卷积,随后去除“直流”(Direct Current)分量并对数据集进行归一化。质量分数定义为卷积结果中点处的强度与其余峰的强度之比。该方法使用具有各种排列的理论质谱以及几个真实的MS数据集进行了验证。结果令人鼓舞,对于具有良好质量分数的谱显示出高百分比的阳性预测率。
我们在这项工作中展示了一种确定串联MS数据集质量的方法。通过在将串联MS数据应用于蛋白质鉴定算法之前预先确定其质量,避免了由于串联MS数据质量差而导致的虚假蛋白质预测,使科学家对预测结果更有信心。我们得出结论,该算法性能良好,有可能用作所有基于质谱的蛋白质鉴定工具的预处理。