College of Science, China University of Petroleum, Changjiang West Road, Qingdao, 266580, China.
School of Engineering and Applied Science, Western Kentucky University, Bowling Green, 42101, KY, USA.
BMC Genomics. 2020 Apr 25;21(1):324. doi: 10.1186/s12864-020-6693-y.
Post-database search is a key procedure in peptide identification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with a distribution of unbalanced PSMs. A more efficient learning strategy is required for improving the accuracy of peptide identification on challenging datasets. While complex learning models have larger power of classification, they may cause overfitting problems and introduce computational complexity on large-scale datasets. Kernel methods map data from the sample space to high dimensional spaces where data relationships can be simplified for modeling.
In order to tackle the computational challenge of using the kernel-based learning model for practical peptide identification problems, we present an online learning algorithm, OLCS-Ranker, which iteratively feeds only one training sample into the learning model at each round, and, as a result, the memory requirement for computation is significantly reduced. Meanwhile, we propose a cost-sensitive learning model for OLCS-Ranker by using a larger loss of decoy PSMs than that of target PSMs in the loss function.
The new model can reduce its false discovery rate on datasets with a distribution of unbalanced PSMs. Experimental studies show that OLCS-Ranker outperforms other methods in terms of accuracy and stability, especially on datasets with a distribution of unbalanced PSMs. Furthermore, OLCS-Ranker is 15-85 times faster than CRanker.
在采用串联质谱(MS/MS)策略进行肽鉴定时,数据库搜索后处理是一个关键步骤,旨在优化数据库搜索引擎生成的肽谱匹配(PSM)。尽管已经开发了许多基于统计和机器学习的方法来提高肽鉴定的准确性,但在大规模数据集和 PSM 分布不均衡的数据集中,仍然存在挑战。需要更有效的学习策略来提高困难数据集上肽鉴定的准确性。虽然复杂的学习模型具有更强的分类能力,但它们可能会导致过拟合问题,并在大规模数据集上引入计算复杂性。核方法将数据从样本空间映射到高维空间,在高维空间中可以简化数据关系进行建模。
为了解决在实际肽鉴定问题中使用基于核的学习模型的计算挑战,我们提出了一种在线学习算法 OLCS-Ranker,该算法在每一轮迭代中仅将一个训练样本输入到学习模型中,从而大大减少了计算所需的内存。同时,我们在损失函数中使用诱饵 PSM 的损失大于目标 PSM 的损失,为 OLCS-Ranker 提出了一种代价敏感的学习模型。
该新模型可以降低 PSM 分布不均衡数据集的假阳性率。实验研究表明,OLCS-Ranker 在准确性和稳定性方面优于其他方法,尤其是在 PSM 分布不均衡的数据集上。此外,OLCS-Ranker 比 CRanker 快 15-85 倍。