Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr,, Saskatoon, S7N 5A9, Canada.
Proteome Sci. 2012 Jun 21;10 Suppl 1(Suppl 1):S12. doi: 10.1186/1477-5956-10-S1-S12.
In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets.
This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra.
Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.
在单个蛋白质组学项目中,串联质谱仪可以产生数亿个串联质谱。然而,大多数串联质谱的质量都很差,在数据库中搜索肽时会浪费大量时间。因此,质量评估(在数据库搜索之前)在通过串联质谱进行蛋白质鉴定的流水线中非常有用,特别是在减少搜索时间和减少错误识别方面。大多数现有的质量评估方法都是基于描述串联质谱质量的许多特征的有监督机器学习方法。这些方法需要具有所有光谱质量信息的训练数据集,但对于新数据集通常无法获得。
本研究提出了一种无需任何训练数据集的用于串联质谱质量评估的无监督机器学习方法。该方法通过基于单个特征的质量评估来估计高质量光谱的条件概率。通过约束优化问题来估计概率。开发了一种有效的算法来解决约束优化问题,并证明其是收敛的。在两个数据集上的实验结果表明,如果我们仅搜索由所提出的方法确定的高质量的串联光谱,则可以在仅丢失少量高质量光谱的情况下,节省约 56%和 62%的数据库搜索时间。
结果表明,所提出的方法在串联质谱质量评估方面具有良好的性能,并且我们估计条件概率的方法是有效的。