Adrem Data Lab, Department of Computer Science, University of Antwerp, Middelheimlaan 1, 2020 Antwerpen, Belgium.
J Am Soc Mass Spectrom. 2024 Sep 4;35(9):2143-2155. doi: 10.1021/jasms.4c00180. Epub 2024 Jul 29.
In computational proteomics, machine learning (ML) has emerged as a vital tool for enhancing data analysis. Despite significant advancements, the diversity of ML model architectures and the complexity of proteomics data present substantial challenges in the effective development and evaluation of these tools. Here, we highlight the necessity for high-quality, comprehensive data sets to train ML models and advocate for the standardization of data to support robust model development. We emphasize the instrumental role of key data sets like ProteomeTools and MassIVE-KB in advancing ML applications in proteomics and discuss the implications of data set size on model performance, highlighting that larger data sets typically yield more accurate models. To address data scarcity, we explore algorithmic strategies such as self-supervised pretraining and multitask learning. Ultimately, we hope that this discussion can serve as a call to action for the proteomics community to collaborate on data standardization and collection efforts, which are crucial for the sustainable advancement and refinement of ML methodologies in the field.
在计算蛋白质组学中,机器学习(ML)已成为增强数据分析的重要工具。尽管取得了重大进展,但 ML 模型架构的多样性和蛋白质组学数据的复杂性在这些工具的有效开发和评估方面带来了巨大的挑战。在这里,我们强调了需要高质量、全面的数据集来训练 ML 模型,并提倡对数据进行标准化,以支持稳健的模型开发。我们强调了 ProteomeTools 和 MassIVE-KB 等关键数据集在推进蛋白质组学中 ML 应用方面的重要作用,并讨论了数据集大小对模型性能的影响,指出更大的数据集通常会产生更准确的模型。为了解决数据稀缺的问题,我们探索了自监督预训练和多任务学习等算法策略。最终,我们希望本次讨论可以呼吁蛋白质组学社区共同努力实现数据标准化和收集,这对于该领域 ML 方法的可持续发展和完善至关重要。