Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada.
J Integr Bioinform. 2023 Jul 28;20(2). doi: 10.1515/jib-2022-0055. eCollection 2023 Jun 1.
Transmembrane transport proteins (transporters) play a crucial role in the fundamental cellular processes of all organisms by facilitating the transport of hydrophilic substrates across hydrophobic membranes. Despite the availability of numerous membrane protein sequences, their structures and functions remain largely elusive. Recently, natural language processing (NLP) techniques have shown promise in the analysis of protein sequences. Bidirectional Encoder Representations from Transformers (BERT) is an NLP technique adapted for proteins to learn contextual embeddings of individual amino acids within a protein sequence. Our previous strategy, TooT-BERT-T, differentiated transporters from non-transporters by employing a logistic regression classifier with fine-tuned representations from ProtBERT-BFD. In this study, we expand upon this approach by utilizing representations from ProtBERT, ProtBERT-BFD, and MembraneBERT in combination with classical classifiers. Additionally, we introduce TooT-BERT-CNN-T, a novel method that fine-tunes ProtBERT-BFD and discriminates transporters using a Convolutional Neural Network (CNN). Our experimental results reveal that CNN surpasses traditional classifiers in discriminating transporters from non-transporters, achieving an MCC of 0.89 and an accuracy of 95.1 % on the independent test set. This represents an improvement of 0.03 and 1.11 percentage points compared to TooT-BERT-T, respectively.
跨膜转运蛋白(transporters)通过促进亲水分子穿过疏水分子膜,在所有生物体的基本细胞过程中发挥着关键作用。尽管有大量的膜蛋白序列,但它们的结构和功能仍然很大程度上难以捉摸。最近,自然语言处理(NLP)技术在分析蛋白质序列方面显示出了潜力。Bidirectional Encoder Representations from Transformers (BERT) 是一种适用于蛋白质的 NLP 技术,用于学习蛋白质序列中单个氨基酸的上下文嵌入。我们之前的策略 TooT-BERT-T 通过使用 ProtBERT-BFD 微调表示的逻辑回归分类器来区分转运蛋白和非转运蛋白。在这项研究中,我们通过结合使用 ProtBERT、ProtBERT-BFD 和 MembraneBERT 的表示以及经典分类器来扩展了该方法。此外,我们引入了 TooT-BERT-CNN-T,这是一种使用卷积神经网络(CNN)微调 ProtBERT-BFD 并区分转运蛋白的新方法。我们的实验结果表明,CNN 在区分转运蛋白和非转运蛋白方面优于传统分类器,在独立测试集上的 MCC 为 0.89,准确率为 95.1%。与 TooT-BERT-T 相比,分别提高了 0.03 和 1.11 个百分点。