Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark.
Department of Chemistry and Bioscience, Aalborg University, 9220 Aalborg, Denmark.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad096. Epub 2023 Nov 20.
Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.
We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.
Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.
机器学习(ML)技术,特别是深度学习(DL),在预测性质谱(MS)中越来越受到关注,用于增强从原始数据分析到最终用户预测和重新评分的数据处理管道。ML 模型需要大规模数据集进行训练和重新使用,可以从一系列公共数据存储库中获得。然而,将 ML 应用于更大规模的公共 MS 数据集具有挑战性,因为它们在数据采集方法、生物系统和实验设计方面差异很大。
我们旨在通过对公共 MS 存储库中潜在变异性的来源进行系统分析,促进 MS 数据中的 ML 工作。我们还研究了这些因素如何影响 ML 性能,并进行全面的迁移学习,以评估该领域当前最佳实践方法在迁移学习方面的优势。
我们的研究结果表明,项目内的同质性明显高于项目之间的同质性,这表明构建最接近未来测试用例的数据集非常重要,因为对于看不见的数据集,可转移性受到严重限制。我们还发现,尽管迁移学习确实提高了模型性能,但与非预训练模型相比,它并没有提高模型性能。