Ding Jiarui, Shi Jinhong, Wu Fang-Xiang
Department of Mechanical Engineering, University of Saskatchewan, 57 Campus Dr., Saskatoon, SK S7N5A9, Canada.
Annu Int Conf IEEE Eng Med Biol Soc. 2009;2009:6747-50. doi: 10.1109/IEMBS.2009.5332499.
Several computational methods have been proposed to assess the quality of tandem mass spectra. These methods range from supervised to unsupervised algorithms, discriminative to generative models. Unsupervised learning algorithms for tandem mass spectra are not probabilistic model based and they don't provide probabilities for spectra quality assessment. In this study, the distribution of high quality spectra and poor quality spectra are modeled by a mixture of Gaussian distributions. The Expectation Maximization (EM) algorithm is used to estimate the parameters of the Gaussian mixture model. A spectrum is assigned to the high quality or poor quality cluster according to its posterior probability. Experiments are conducted on two datasets: ISB and TOV. The results show about 57.64% and 66.38% of poor quality spectra can be removed without losing more than 10% of high quality spectra for the two spectral datasets, respectively. This indicates clustering as an exploratory data analysis tool is valuable for the quality assessment of tandem mass spectra without using a pre-labeled training dataset.
已经提出了几种计算方法来评估串联质谱的质量。这些方法涵盖了从监督算法到无监督算法,从判别模型到生成模型。用于串联质谱的无监督学习算法不是基于概率模型的,并且它们不提供用于谱质量评估的概率。在本研究中,通过高斯分布的混合对高质量谱和低质量谱的分布进行建模。期望最大化(EM)算法用于估计高斯混合模型的参数。根据其后验概率将一个谱分配到高质量或低质量簇中。在两个数据集上进行了实验:ISB和TOV。结果表明,对于这两个谱数据集,分别可以去除约57.64%和66.38%的低质量谱,而不会损失超过10%的高质量谱。这表明聚类作为一种探索性数据分析工具对于在不使用预标记训练数据集的情况下进行串联质谱的质量评估是有价值的。