Suppr超能文献

GT-Finder:使用预训练的 BERT 语言模型对葡萄糖转运蛋白家族进行分类。

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models.

机构信息

Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.

Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.

出版信息

Comput Biol Med. 2021 Apr;131:104259. doi: 10.1016/j.compbiomed.2021.104259. Epub 2021 Feb 7.

Abstract

Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.

摘要

最近,语言表示模型在自然语言处理 (NLP) 领域引起了广泛关注,因为它们取得了显著的成果。其中,BERT(来自 Transformer 的双向编码器表示)已被证明是一种简单而强大的语言模型,它实现了新的最先进的性能。BERT 采用了上下文化词嵌入的概念,以捕获单词出现的语义和上下文。我们利用预先训练的 BERT 模型从蛋白质序列中提取特征,以区分三种葡萄糖转运蛋白家族:主要易化因子超家族葡萄糖转运蛋白 (GLUTs)、钠-葡萄糖协同转运蛋白 (SGLTs) 和糖最终输出转运蛋白 (SWEETs)。我们将蛋白质序列视为句子,并将其转换为固定长度的有意义向量,其中 768 或 1024 维向量表示每个氨基酸。我们观察到 BERT-Base 和 BERT-Large 模型在平均灵敏度和马修斯相关系数 (MCC) 方面提高了 4%以上的性能,表明了这种方法的效率。我们还开发了一种基于双向转换器的蛋白质模型 (TransportersBERT),与现有的预训练 BERT 模型进行比较。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验