ATP_mCNN：通过预训练语言模型和多窗口神经网络预测ATP结合位点。

ATP_mCNN: Predicting ATP binding sites through pretrained language models and multi-window neural networks.

作者信息

Le Van-The, Malik Muhammad-Shahid, Lin Yi-Jing, Liu Yu-Chen, Chang Yan-Yun, Ou Yu-Yen

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; Department of Computer Sciences, Karakoram International University, Gilgit-Baltistan, 15100, Pakistan.

出版信息

Comput Biol Med. 2025 Feb;185:109541. doi: 10.1016/j.compbiomed.2024.109541. Epub 2024 Dec 8.

DOI:10.1016/j.compbiomed.2024.109541

PMID:39653625

Abstract

Adenosine triphosphate plays a vital role in providing energy and enabling key cellular processes through interactions with binding proteins. The increasing amount of protein sequence data necessitates computational methods for identifying binding sites. However, experimental identification of adenosine triphosphate-binding residues remains challenging. To address the challenge, we developed a multi-window convolutional neural network architecture taking pre-trained protein language model embeddings as input features. In particular, multiple parallel convolutional layers scan for motifs localized to different window sizes. Max pooling extracts salient features concatenated across windows into a final multi-scale representation for residue-level classification. On benchmark datasets, our model achieves an area under the ROC curve of 0.95, significantly improving on prior sequence-based models and outperforming convolutional neural network baselines. This demonstrates the utility of pre-trained language models and multi-window convolutional neural networks for advanced sequence-based prediction of adenosine triphosphate-binding residues. Our approach provides a promising new direction for elucidating binding mechanisms and interactions from primary structure.

摘要

三磷酸腺苷在通过与结合蛋白相互作用提供能量和实现关键细胞过程中起着至关重要的作用。蛋白质序列数据量的不断增加使得识别结合位点的计算方法成为必要。然而，三磷酸腺苷结合残基的实验鉴定仍然具有挑战性。为应对这一挑战，我们开发了一种多窗口卷积神经网络架构，将预训练的蛋白质语言模型嵌入作为输入特征。具体而言，多个并行卷积层扫描定位到不同窗口大小的基序。最大池化提取跨窗口连接的显著特征，形成用于残基水平分类的最终多尺度表示。在基准数据集上，我们的模型实现了0.95的ROC曲线下面积，显著优于先前基于序列的模型，并优于卷积神经网络基线。这证明了预训练语言模型和多窗口卷积神经网络在基于序列的三磷酸腺苷结合残基高级预测中的效用。我们的方法为从一级结构阐明结合机制和相互作用提供了一个有前景的新方向。