Suppr超能文献

基于深度双向变压器编码器的上下文表示识别外排蛋白。

Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders.

作者信息

Taju Semmy Wellem, Shah Syed Muazzam Ali, Ou Yu-Yen

机构信息

Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.

Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.

出版信息

Anal Biochem. 2021 Nov 15;633:114416. doi: 10.1016/j.ab.2021.114416. Epub 2021 Oct 14.

Abstract

Efflux proteins are the transport proteins expressed in the plasma membrane, which are involved in the movement of unwanted toxic substances through specific efflux pumps. Several studies based on computational approaches have been proposed to predict transport proteins and thereby to understand the mechanism of the movement of ions across cell membranes. However, few methods were developed to identify efflux proteins. This paper presents an approach based on the contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) with the Support Vector Machine (SVM) classifier. BERT is the most effective pre-trained language model that performs exceptionally well on several Natural Language Processing (NLP) tasks. Therefore, the contextualized representations from BERT were implemented to incorporate multiple interpretations of identical amino acids in the sequence. A dataset of efflux proteins with annotations was first established. The feature vectors were extracted by transferring protein data through the hidden layers of the pre-trained model. Our proposed method was trained on complete training datasets to identify efflux proteins and achieved the accuracies of 94.15% and 87.13% in the independent tests on membrane and transport datasets, respectively. This study opens a research avenue for the implementation of contextualized word embeddings in Bioinformatics and Computational Biology.

摘要

外排蛋白是在质膜中表达的转运蛋白,它们通过特定的外排泵参与排出不需要的有毒物质。已经提出了几项基于计算方法的研究来预测转运蛋白,从而了解离子跨细胞膜移动的机制。然而,开发用于识别外排蛋白的方法很少。本文提出了一种基于来自变换器双向编码器表示(BERT)的上下文词嵌入与支持向量机(SVM)分类器的方法。BERT是最有效的预训练语言模型,在多个自然语言处理(NLP)任务中表现出色。因此,采用了来自BERT的上下文表示,以纳入序列中相同氨基酸的多种解释。首先建立了一个带有注释的外排蛋白数据集。通过将蛋白质数据传输通过预训练模型的隐藏层来提取特征向量。我们提出的方法在完整的训练数据集上进行训练,以识别外排蛋白,在膜数据集和转运数据集的独立测试中分别达到了94.15%和87.13%的准确率。这项研究为在生物信息学和计算生物学中实现上下文词嵌入开辟了一条研究途径。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验