DeepPD：一种基于多特征表示和信息瓶颈预测肽段可检测性的深度学习方法。

DeepPD: A Deep Learning Method for Predicting Peptide Detectability Based on Multi-feature Representation and Information Bottleneck.

作者信息

Li Fenglin, Bin Yannan, Zhao Jianping, Zheng Chunhou

机构信息

College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China.

Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China.

出版信息

Interdiscip Sci. 2025 Mar;17(1):200-214. doi: 10.1007/s12539-024-00665-4. Epub 2024 Dec 11.

DOI:10.1007/s12539-024-00665-4

PMID:39661307

Abstract

Peptide detectability measures the relationship between the protein composition and abundance in the sample and the peptides identified during the analytical procedure. This relationship has significant implications for the fundamental tasks of proteomics. Existing methods primarily rely on a single type of feature representation, which limits their ability to capture the intricate and diverse characteristics of peptides. In response to this limitation, we introduce DeepPD, an innovative deep learning framework incorporating multi-feature representation and the information bottleneck principle (IBP) to predict peptide detectability. DeepPD extracts semantic information from peptides using evolutionary scale modeling 2 (ESM-2) and integrates sequence and evolutionary information to construct the feature space collaboratively. The IBP effectively guides the feature learning process, minimizing redundancy in the feature space. Experimental results across various datasets demonstrate that DeepPD outperforms state-of-the-art methods. Furthermore, we demonstrate that DeepPD exhibits competitive generalization and transfer learning capabilities across diverse datasets and species. In conclusion, DeepPD emerges as the most effective method for predicting peptide detectability, showcasing its potential applicability to other protein sequence prediction tasks.

摘要

肽段可检测性衡量的是样品中蛋白质组成与丰度和分析过程中鉴定出的肽段之间的关系。这种关系对蛋白质组学的基本任务具有重要意义。现有方法主要依赖单一类型的特征表示，这限制了它们捕捉肽段复杂多样特征的能力。针对这一限制，我们引入了DeepPD，这是一个创新的深度学习框架，它结合了多特征表示和信息瓶颈原理（IBP）来预测肽段可检测性。DeepPD使用进化尺度建模2（ESM-2）从肽段中提取语义信息，并整合序列和进化信息以协同构建特征空间。IBP有效地指导特征学习过程，最大限度地减少特征空间中的冗余。在各种数据集上的实验结果表明，DeepPD优于现有最先进的方法。此外，我们证明DeepPD在不同数据集和物种上表现出具有竞争力的泛化和迁移学习能力。总之，DeepPD成为预测肽段可检测性的最有效方法，展示了其在其他蛋白质序列预测任务中的潜在适用性。