Rehfeldt Tobias Greisager, Krawczyk Konrad, Bøgebjerg Mathias, Schwämmle Veit, Röttger Richard
Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark.
Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark.
Bioinformatics. 2022 Jan 12;38(3):875-877. doi: 10.1093/bioinformatics/btab701.
Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems.
We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.
An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai.
Supplementary data are available at Bioinformatics online.
液相色谱-质谱联用(LC-MS)是通过对数千种蛋白质进行鉴定和定量来分析生物样品中蛋白质组的既定标准。机器学习(ML)有望显著改善对所得数据的分析,然而,目前尚无任何工具能够介导从原始数据到现代ML应用的路径。更具体地说,ML应用目前受到三个主要限制:(i)缺乏大样本量的平衡训练数据;(ii)对于例如肽段鉴定等足够信息丰富的数据表示的定义不明确;(iii)缺乏针对特定LC-MS问题的ML方法的基准测试。
我们创建了MS2AI管道,该管道可自动收集大量MS数据以用于大规模ML应用。该软件可从内部来源或蛋白质组学鉴定数据库PRIDE中检索原始数据。随后,原始数据以适合ML的标准化格式存储,包括MS1/MS2光谱和肽段鉴定。该工具弥合了MS与AI之间的差距,为此我们还展示了一种以卷积神经网络形式的ML应用,用于氧化肽段的鉴定。
该软件的开源实现可在https://gitlab.com/roettgerlab/ms2ai上找到。
补充数据可在《生物信息学》在线获取。