Institute for Mathematics and Computer Science, University of Southern Denmark, 5000 Odense, Denmark.
VIB-UGent Center for Medical Biotechnology, VIB, Ghent 9052, Belgium.
J Proteome Res. 2023 Feb 3;22(2):632-636. doi: 10.1021/acs.jproteome.2c00629. Epub 2023 Jan 24.
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.
数据集的获取和管理通常是机器学习工作中最困难和最耗时的部分。对于基于蛋白质组学的液相色谱 (LC) 与质谱 (MS) 数据集来说尤其如此,这是因为在原始数据和机器学习准备好的数据之间会发生大量的数据缩减。由于预测蛋白质组学是一个新兴领域,在预测 LC-MS 设定中的肽行为时,每个实验室通常使用独特且复杂的数据处理管道,以最大限度地提高性能,而牺牲了可访问性和可重复性。出于这个原因,我们引入了 ProteomicsML,这是一个在线资源,提供了基于蛋白质组学的数据集和针对大多数当前探索的物理化学肽性质的教程。这个由社区驱动的资源使得以易于处理的格式访问数据变得简单,并包含易于遵循的教程,即使是该领域最先进的算法,新用户也可以与之交互。ProteomicsML 提供了数据集,这些数据集可用于比较最先进的机器学习算法,同时为教师和该领域的新手提供入门材料。该平台可在 https://www.proteomicsml.org/ 上免费获得,我们欢迎整个蛋白质组学社区在 https://github.com/ProteomicsML/ProteomicsML 上为该项目做出贡献。