MESM：通过多模态语言模型整合多源数据以进行高精度蛋白质-蛋白质相互作用预测

MESM: integrating multi-source data for high-accuracy protein-protein interactions prediction through multimodal language models.

作者信息

Wang Feng, Chu Jinming, Shen Liyan, Chang Shan

机构信息

School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou University, Changzhou, 213164, China.

Changzhou University Huaide College, Taizhou, 214500, China.

出版信息

BMC Biol. 2025 Aug 11;23(1):253. doi: 10.1186/s12915-025-02356-y.

DOI:10.1186/s12915-025-02356-y

PMID:40784875

Abstract

BACKGROUND

Protein-protein interactions (PPIs) play a critical role in essential biological processes such as signal transduction, enzyme activity regulation, cytoskeletal structure, immune responses, and gene regulation. However, current methods mainly focus on extracting features from protein sequences and using graph neural network (GNN) to acquire interaction information from the PPI network graph. This limits the model's ability to learn richer and more effective interaction information, thereby affecting prediction performance.

RESULTS

In this study, we propose a novel deep learning method, MESM, for effectively predicting PPI. The datasets used for the PPI prediction task were primarily constructed from the STRING database, including two Homo sapiens PPI datasets, SHS27k and SHS148k, and two Saccharomyces cerevisiae PPI datasets, SYS30k and SYS60k. MESM consists of three key modules, as follows: First, MESM extracts multimodal representations from protein sequence information, protein structure information, and point cloud features through Sequence Variational Autoencoder (SVAE), Variational Graph Autoencoder (VGAE), and PointNet Autoencoder (PAE). Then, Fusion Autoencoder (FAE) is used to integrate these multimodal features, generating rich and balanced protein representations. Next, MESM leverages GraphGPS to learn structural information from the PPI network graph structure and combines Graph Attention Network (GAT) to further capture protein interaction information. Finally, MESM uses Graph Convolutional Network (GCN) and SubgraphGCN to extract global and local features from the perspective of the overall graph and subgraphs. Moreover, we build seven independent graphs from the overall PPI network graph to specifically learn the features of each PPI type, thereby enhancing the model's learning ability for different types of interactions.

CONCLUSIONS

Compared to the state-of-the-art methods, MESM achieved improvements of 8.77%, 4.98%, 7.48%, and 6.08% on SHS27k, SHS148k, SYS30k, and SYS60k, respectively. The experimental results demonstrate that MESM exhibits significant improvements in PPI prediction performance.

摘要

背景

蛋白质-蛋白质相互作用（PPI）在信号转导、酶活性调节、细胞骨架结构、免疫反应和基因调控等基本生物学过程中起着关键作用。然而，目前的方法主要集中于从蛋白质序列中提取特征，并使用图神经网络（GNN）从PPI网络图中获取相互作用信息。这限制了模型学习更丰富、更有效相互作用信息的能力，从而影响预测性能。

结果

在本研究中，我们提出了一种新颖的深度学习方法MESM，用于有效预测PPI。用于PPI预测任务的数据集主要由STRING数据库构建而成，包括两个人类PPI数据集SHS27k和SHS148k，以及两个酿酒酵母PPI数据集SYS30k和SYS60k。MESM由三个关键模块组成，具体如下：首先，MESM通过序列变分自编码器（SVAE）、变分图自编码器（VGAE）和点云自编码器（PAE）从蛋白质序列信息、蛋白质结构信息和点云特征中提取多模态表示。然后，使用融合自编码器（FAE）整合这些多模态特征，生成丰富且平衡的蛋白质表示。接下来，MESM利用GraphGPS从PPI网络图结构中学习结构信息，并结合图注意力网络（GAT）进一步捕捉蛋白质相互作用信息。最后，MESM使用图卷积网络（GCN）和子图GCN从整体图和子图的角度提取全局和局部特征。此外，我们从整体PPI网络图构建七个独立的图，以专门学习每种PPI类型的特征，从而增强模型对不同类型相互作用的学习能力。