Gou Wenrui, Ge Wenhui, Tan Yang, Li Mingchen, Fan Guisheng, Yu Huiqun
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
Shanghai Engineering Research Center of Smart Energy, Shanghai, 201103, China.
Interdiscip Sci. 2025 Jun 8. doi: 10.1007/s12539-025-00732-4.
Protein structures are fundamental to understanding their functions and interactions. With the continuous advancement of protein structure prediction methods, structure databases are rapidly expanding. Identifying the origin of protein structures is crucial for assessing the reliability of experimental resolution and computational prediction methods, as well as for guiding downstream biological research. Existing protein representation approaches often fail to capture subtle yet critical structural differences, posing challenges for precise structural traceability. To address this, we propose a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), for the representation and origin evaluation of protein structures. CPE-Pro integrates a pre-trained protein Structural Sequence Language Model (SSLM) and Geometric Vector Perceptron-Graph Neural Network (GVP-GNN) to learn structure-aware protein representations and capture structural differences, enabling accurate classification across four origins of structural data. Preliminary results indicate that, compared to large-scale protein language models trained on extensive amino acid sequences, structural sequences enriched with local structural features enable the model to capture more informative protein characteristics, thereby enhancing and refining protein representations. Future research directions include extending the architecture to additional protein structure paradigms and developing evaluation methodologies for low-pLDDT predicted structures, providing more effective tools for protein structure analysis. The code, model weights, and all relevant materials are available at https://github.com/wr1102/CPE-Pro .
蛋白质结构对于理解其功能和相互作用至关重要。随着蛋白质结构预测方法的不断进步,结构数据库正在迅速扩展。确定蛋白质结构的起源对于评估实验分辨率和计算预测方法的可靠性,以及指导下游生物学研究至关重要。现有的蛋白质表示方法往往无法捕捉到细微但关键的结构差异,给精确的结构溯源带来了挑战。为了解决这个问题,我们提出了一种结构敏感的监督深度学习模型,即蛋白质结构晶体与预测评估器(CPE-Pro),用于蛋白质结构的表示和起源评估。CPE-Pro集成了预训练的蛋白质结构序列语言模型(SSLM)和几何向量感知器-图神经网络(GVP-GNN),以学习结构感知的蛋白质表示并捕捉结构差异,从而能够对结构数据的四个起源进行准确分类。初步结果表明,与在广泛的氨基酸序列上训练的大规模蛋白质语言模型相比,富含局部结构特征的结构序列使模型能够捕捉到更多信息丰富的蛋白质特征,从而增强和优化蛋白质表示。未来的研究方向包括将该架构扩展到其他蛋白质结构范式,以及为低pLDDT预测结构开发评估方法,为蛋白质结构分析提供更有效的工具。代码、模型权重和所有相关材料可在https://github.com/wr1102/CPE-Pro获取。