Abdul-Khalek Naim, Picciani Mario, Shouman Omar, Wimmer Reinhard, Overgaard Michael Toft, Wilhelm Mathias, Gregersen Echers Simon
Department of Chemistry and Bioscience, Aalborg University, 9220 Aalborg, Denmark.
Computational Mass Spectrometry, School of Life Sciences, Technical University of Munich, 85354 Freising, Germany.
J Proteome Res. 2025 Jun 6;24(6):2709-2726. doi: 10.1021/acs.jproteome.4c00973. Epub 2025 May 9.
Identifying detectable peptides, known as flyers, is key in mass spectrometry-based proteomics. Peptide detectability is strongly related to peptide sequences and their resulting physicochemical properties. Moreover, the high variability in MS data challenges the development of a generic model for detectability prediction, underlining the need for customizable tools. We present Pfly, a deep learning model developed to predict peptide detectability based solely on peptide sequence. Pfly is a versatile and reliable state-of-the-art tool, offering high performance, accessibility, and easy customizability for end-users. This adaptability allows researchers to tailor Pfly to specific experimental conditions, improving accuracy and expanding applicability across various research fields. Pfly is an encoder-decoder with an attention mechanism, classifying peptides as flyers or non-flyers, and providing both binary and categorical probabilities for four distinct classes defined in this study. The model was initially trained on a synthetic peptide library and subsequently fine-tuned with a biological dataset to mitigate bias toward synthesizability, improving predictive capacity and outperforming state-of-the-art predictors in benchmark comparisons across different human and cross-species datasets. The study further investigates the influence of protein abundance and rescoring, illustrating the negative impact on peptide identification due to misclassification. Pfly has been integrated into the DLOmix framework and is accessible on GitHub at https://github.com/wilhelm-lab/dlomix.
识别可检测的肽段(即所谓的“飞行物”)是基于质谱的蛋白质组学的关键。肽段的可检测性与肽段序列及其产生的物理化学性质密切相关。此外,质谱数据的高度变异性对可检测性预测通用模型的开发提出了挑战,这突出了对可定制工具的需求。我们提出了Pfly,这是一种深度学习模型,旨在仅根据肽段序列预测肽段的可检测性。Pfly是一种通用且可靠的先进工具,为终端用户提供高性能、可访问性和易于定制性。这种适应性使研究人员能够根据特定的实验条件对Pfly进行定制,提高准确性并扩大其在各个研究领域的适用性。Pfly是一种带有注意力机制的编码器-解码器,将肽段分类为“飞行物”或“非飞行物”,并为本研究中定义的四个不同类别提供二元和分类概率。该模型最初在合成肽库上进行训练,随后使用生物数据集进行微调,以减轻对可合成性的偏向,提高预测能力,并在不同人类和跨物种数据集的基准比较中优于现有最先进的预测器。该研究进一步调查了蛋白质丰度和重新评分的影响,说明了错误分类对肽段鉴定的负面影响。Pfly已集成到DLOmix框架中,可在GitHub上通过https://github.com/wilhelm-lab/dlomix访问。