基于深度双向变压器编码器的上下文表示识别外排蛋白。

Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders.

作者信息

Taju Semmy Wellem, Shah Syed Muazzam Ali, Ou Yu-Yen

机构信息

Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.

出版信息

Anal Biochem. 2021 Nov 15;633:114416. doi: 10.1016/j.ab.2021.114416. Epub 2021 Oct 14.

DOI:10.1016/j.ab.2021.114416

PMID:34656612

Abstract

Efflux proteins are the transport proteins expressed in the plasma membrane, which are involved in the movement of unwanted toxic substances through specific efflux pumps. Several studies based on computational approaches have been proposed to predict transport proteins and thereby to understand the mechanism of the movement of ions across cell membranes. However, few methods were developed to identify efflux proteins. This paper presents an approach based on the contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) with the Support Vector Machine (SVM) classifier. BERT is the most effective pre-trained language model that performs exceptionally well on several Natural Language Processing (NLP) tasks. Therefore, the contextualized representations from BERT were implemented to incorporate multiple interpretations of identical amino acids in the sequence. A dataset of efflux proteins with annotations was first established. The feature vectors were extracted by transferring protein data through the hidden layers of the pre-trained model. Our proposed method was trained on complete training datasets to identify efflux proteins and achieved the accuracies of 94.15% and 87.13% in the independent tests on membrane and transport datasets, respectively. This study opens a research avenue for the implementation of contextualized word embeddings in Bioinformatics and Computational Biology.

摘要

外排蛋白是在质膜中表达的转运蛋白，它们通过特定的外排泵参与排出不需要的有毒物质。已经提出了几项基于计算方法的研究来预测转运蛋白，从而了解离子跨细胞膜移动的机制。然而，开发用于识别外排蛋白的方法很少。本文提出了一种基于来自变换器双向编码器表示（BERT）的上下文词嵌入与支持向量机（SVM）分类器的方法。BERT是最有效的预训练语言模型，在多个自然语言处理（NLP）任务中表现出色。因此，采用了来自BERT的上下文表示，以纳入序列中相同氨基酸的多种解释。首先建立了一个带有注释的外排蛋白数据集。通过将蛋白质数据传输通过预训练模型的隐藏层来提取特征向量。我们提出的方法在完整的训练数据集上进行训练，以识别外排蛋白，在膜数据集和转运数据集的独立测试中分别达到了94.15%和87.13%的准确率。这项研究为在生物信息学和计算生物学中实现上下文词嵌入开辟了一条研究途径。

相似文献

Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders.

Anal Biochem. 2021 Nov 15;633:114416. doi: 10.1016/j.ab.2021.114416. Epub 2021 Oct 14.

ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations.

Comput Biol Chem. 2021 Aug;93:107537. doi: 10.1016/j.compbiolchem.2021.107537. Epub 2021 Jun 29.

TRP-BERT: Discrimination of transient receptor potential (TRP) channels using contextual representations from deep bidirectional transformer based on BERT.

Comput Biol Med. 2021 Oct;137:104821. doi: 10.1016/j.compbiomed.2021.104821. Epub 2021 Sep 1.

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models.

Comput Biol Med. 2021 Apr;131:104259. doi: 10.1016/j.compbiomed.2021.104259. Epub 2021 Feb 7.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection.

Comput Biol Chem. 2022 Aug;99:107732. doi: 10.1016/j.compbiolchem.2022.107732. Epub 2022 Jul 14.

BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models.

Bioinformatics. 2022 Jan 12;38(3):648-654. doi: 10.1093/bioinformatics/btab712.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

The h-ANN Model: Comprehensive Colonoscopy Concept Compilation Using Combined Contextual Embeddings.

Biomed Eng Syst Technol Int Jt Conf BIOSTEC Revis Sel Pap. 2022 Feb;5:189-200. doi: 10.5220/0010903300003123.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

A BERT-based approach for identifying anti-inflammatory peptides using sequence information.

Heliyon. 2024 Jun 13;10(12):e32951. doi: 10.1016/j.heliyon.2024.e32951. eCollection 2024 Jun 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于深度双向变压器编码器的上下文表示识别外排蛋白。

Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献