一种基于代价敏感的在线学习方法用于肽段鉴定。

A cost-sensitive online learning method for peptide identification.

机构信息

College of Science, China University of Petroleum, Changjiang West Road, Qingdao, 266580, China.

School of Engineering and Applied Science, Western Kentucky University, Bowling Green, 42101, KY, USA.

出版信息

BMC Genomics. 2020 Apr 25;21(1):324. doi: 10.1186/s12864-020-6693-y.

DOI:10.1186/s12864-020-6693-y

PMID:32334531

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7183122/

Abstract

BACKGROUND

Post-database search is a key procedure in peptide identification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with a distribution of unbalanced PSMs. A more efficient learning strategy is required for improving the accuracy of peptide identification on challenging datasets. While complex learning models have larger power of classification, they may cause overfitting problems and introduce computational complexity on large-scale datasets. Kernel methods map data from the sample space to high dimensional spaces where data relationships can be simplified for modeling.

RESULTS

In order to tackle the computational challenge of using the kernel-based learning model for practical peptide identification problems, we present an online learning algorithm, OLCS-Ranker, which iteratively feeds only one training sample into the learning model at each round, and, as a result, the memory requirement for computation is significantly reduced. Meanwhile, we propose a cost-sensitive learning model for OLCS-Ranker by using a larger loss of decoy PSMs than that of target PSMs in the loss function.

CONCLUSIONS

The new model can reduce its false discovery rate on datasets with a distribution of unbalanced PSMs. Experimental studies show that OLCS-Ranker outperforms other methods in terms of accuracy and stability, especially on datasets with a distribution of unbalanced PSMs. Furthermore, OLCS-Ranker is 15-85 times faster than CRanker.

摘要

背景

在采用串联质谱（MS/MS）策略进行肽鉴定时，数据库搜索后处理是一个关键步骤，旨在优化数据库搜索引擎生成的肽谱匹配（PSM）。尽管已经开发了许多基于统计和机器学习的方法来提高肽鉴定的准确性，但在大规模数据集和 PSM 分布不均衡的数据集中，仍然存在挑战。需要更有效的学习策略来提高困难数据集上肽鉴定的准确性。虽然复杂的学习模型具有更强的分类能力，但它们可能会导致过拟合问题，并在大规模数据集上引入计算复杂性。核方法将数据从样本空间映射到高维空间，在高维空间中可以简化数据关系进行建模。

结果

为了解决在实际肽鉴定问题中使用基于核的学习模型的计算挑战，我们提出了一种在线学习算法 OLCS-Ranker，该算法在每一轮迭代中仅将一个训练样本输入到学习模型中，从而大大减少了计算所需的内存。同时，我们在损失函数中使用诱饵 PSM 的损失大于目标 PSM 的损失，为 OLCS-Ranker 提出了一种代价敏感的学习模型。

结论

该新模型可以降低 PSM 分布不均衡数据集的假阳性率。实验研究表明，OLCS-Ranker 在准确性和稳定性方面优于其他方法，尤其是在 PSM 分布不均衡的数据集上。此外，OLCS-Ranker 比 CRanker 快 15-85 倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2be7/7183122/886963d96824/12864_2020_6693_Fig1_HTML.jpg

相似文献

A cost-sensitive online learning method for peptide identification.一种基于代价敏感的在线学习方法用于肽段鉴定。

BMC Genomics. 2020 Apr 25;21(1):324. doi: 10.1186/s12864-020-6693-y.

An adaptive classification model for peptide identification.一种用于肽段鉴定的自适应分类模型。

BMC Genomics. 2015;16 Suppl 11(Suppl 11):S1. doi: 10.1186/1471-2164-16-S11-S1. Epub 2015 Nov 10.

Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics.对低阶统计量进行建模以实现蛋白质组学中无诱饵的错误发现率估计。

J Proteome Res. 2023 Apr 7;22(4):1159-1171. doi: 10.1021/acs.jproteome.2c00604. Epub 2023 Mar 24.

MUMAL2: Improving sensitivity in shotgun proteomics using cost sensitive artificial neural networks and a threshold selector algorithm.MUMAL2：使用成本敏感型人工神经网络和阈值选择算法提高鸟枪法蛋白质组学的灵敏度

BMC Bioinformatics. 2016 Dec 15;17(Suppl 18):472. doi: 10.1186/s12859-016-1341-x.

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.使用多个搜索引擎和明确的指标对蛋白质推断算法进行深入分析。

J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.

MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines.MSblender：一种整合来自多个数据库搜索引擎的肽鉴定的概率方法。

J Proteome Res. 2011 Jul 1;10(7):2949-58. doi: 10.1021/pr2002116. Epub 2011 Apr 29.

Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics.色谱分离中肽保留行为的统计学习：一种用于计算蛋白质组学的基于核的新方法。

BMC Bioinformatics. 2007 Nov 30;8:468. doi: 10.1186/1471-2105-8-468.

Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator.采用改进的 Mascot Percolator 进行电子转移解离增强肽鉴定。

Mol Cell Proteomics. 2012 Aug;11(8):478-91. doi: 10.1074/mcp.O111.014522. Epub 2012 Apr 6.

Two-dimensional target decoy strategy for shotgun proteomics. shotgun 蛋白质组学的二维靶标诱饵策略。

J Proteome Res. 2011 Dec 2;10(12):5296-301. doi: 10.1021/pr200780j. Epub 2011 Nov 7.

Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets.对用于从鸟枪法蛋白质组学数据集中鉴定肽段的渗滤器算法的改进。

J Proteome Res. 2009 Jul;8(7):3737-45. doi: 10.1021/pr801109k.

本文引用的文献

Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra.用于改进串联质谱鉴别分析的生成模型梯度

Adv Neural Inf Process Syst. 2017 Dec;30:5724-5733.

A Matter of Time: Faster Percolator Analysis via Efficient SVM Learning for Large-Scale Proteomics.时间问题：通过高效的 SVM 学习实现大规模蛋白质组学的快速渗透分析。

J Proteome Res. 2018 May 4;17(5):1978-1982. doi: 10.1021/acs.jproteome.7b00767. Epub 2018 Apr 6.

Using the entrapment sequence method as a standard to evaluate key steps of proteomics data analysis process.以截留序列法作为标准来评估蛋白质组学数据分析过程的关键步骤。

BMC Genomics. 2017 Mar 14;18(Suppl 2):143. doi: 10.1186/s12864-017-3491-2.

Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.使用 percolator 3.0 对大规模蛋白质组学数据集进行快速准确的蛋白质假发现率估计。

J Am Soc Mass Spectrom. 2016 Nov;27(11):1719-1727. doi: 10.1007/s13361-016-1460-7. Epub 2016 Aug 29.

An adaptive classification model for peptide identification.一种用于肽段鉴定的自适应分类模型。

BMC Genomics. 2015;16 Suppl 11(Suppl 11):S1. doi: 10.1186/1471-2164-16-S11-S1. Epub 2015 Nov 10.

l2 Multiple Kernel Fuzzy SVM-Based Data Fusion for Improving Peptide Identification.基于多核模糊支持向量机的数据融合用于改进肽段鉴定

IEEE/ACM Trans Comput Biol Bioinform. 2016 Jul-Aug;13(4):804-9. doi: 10.1109/TCBB.2015.2480084. Epub 2015 Sep 18.

Processing shotgun proteomics data on the Amazon cloud with the trans-proteomic pipeline.使用跨蛋白质组学管道在亚马逊云中处理鸟枪法蛋白质组学数据。

Mol Cell Proteomics. 2015 Feb;14(2):399-404. doi: 10.1074/mcp.O114.043380. Epub 2014 Nov 23.

Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics.整合基因组、转录组和互作组数据，提高 shotgun 蛋白质组学中肽和蛋白质的鉴定水平。

J Proteome Res. 2014 Jun 6;13(6):2715-23. doi: 10.1021/pr500194t. Epub 2014 May 12.

Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics.鸟枪法蛋白质组学中用于对肽谱匹配进行评分的经验多维空间。

J Proteome Res. 2014 Apr 4;13(4):1911-20. doi: 10.1021/pr401026y. Epub 2014 Mar 13.

A novel algorithm for validating peptide identification from a shotgun proteomics search engine.一种用于验证 shotgun 蛋白质组学搜索引擎中肽鉴定的新算法。

J Proteome Res. 2013 Mar 1;12(3):1108-19. doi: 10.1021/pr300631t. Epub 2013 Feb 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种基于代价敏感的在线学习方法用于肽段鉴定。

A cost-sensitive online learning method for peptide identification.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献