MS2AI：用于机器学习应用的公共肽段液相色谱-质谱数据的自动重新利用。

MS2AI: automated repurposing of public peptide LC-MS data for machine learning applications.

作者信息

Rehfeldt Tobias Greisager, Krawczyk Konrad, Bøgebjerg Mathias, Schwämmle Veit, Röttger Richard

机构信息

Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark.

Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark.

出版信息

Bioinformatics. 2022 Jan 12;38(3):875-877. doi: 10.1093/bioinformatics/btab701.

DOI:10.1093/bioinformatics/btab701

PMID:34636883

Abstract

MOTIVATION

Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems.

RESULTS

We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.

AVAILABILITY AND IMPLEMENTATION

An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

液相色谱-质谱联用（LC-MS）是通过对数千种蛋白质进行鉴定和定量来分析生物样品中蛋白质组的既定标准。机器学习（ML）有望显著改善对所得数据的分析，然而，目前尚无任何工具能够介导从原始数据到现代ML应用的路径。更具体地说，ML应用目前受到三个主要限制：（i）缺乏大样本量的平衡训练数据；（ii）对于例如肽段鉴定等足够信息丰富的数据表示的定义不明确；（iii）缺乏针对特定LC-MS问题的ML方法的基准测试。

结果

我们创建了MS2AI管道，该管道可自动收集大量MS数据以用于大规模ML应用。该软件可从内部来源或蛋白质组学鉴定数据库PRIDE中检索原始数据。随后，原始数据以适合ML的标准化格式存储，包括MS1/MS2光谱和肽段鉴定。该工具弥合了MS与AI之间的差距，为此我们还展示了一种以卷积神经网络形式的ML应用，用于氧化肽段的鉴定。

可用性与实现

该软件的开源实现可在https://gitlab.com/roettgerlab/ms2ai上找到。

补充信息

补充数据可在《生物信息学》在线获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

MS2AI：用于机器学习应用的公共肽段液相色谱-质谱数据的自动重新利用。

MS2AI: automated repurposing of public peptide LC-MS data for machine learning applications.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性与实现

补充信息

相似文献

引用本文的文献

MS2AI：用于机器学习应用的公共肽段液相色谱-质谱数据的自动重新利用。

MS2AI: automated repurposing of public peptide LC-MS data for machine learning applications.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性与实现

补充信息

相似文献

引用本文的文献