Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
Comput Biol Med. 2021 Apr;131:104259. doi: 10.1016/j.compbiomed.2021.104259. Epub 2021 Feb 7.
Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.
最近,语言表示模型在自然语言处理 (NLP) 领域引起了广泛关注,因为它们取得了显著的成果。其中,BERT(来自 Transformer 的双向编码器表示)已被证明是一种简单而强大的语言模型,它实现了新的最先进的性能。BERT 采用了上下文化词嵌入的概念,以捕获单词出现的语义和上下文。我们利用预先训练的 BERT 模型从蛋白质序列中提取特征,以区分三种葡萄糖转运蛋白家族:主要易化因子超家族葡萄糖转运蛋白 (GLUTs)、钠-葡萄糖协同转运蛋白 (SGLTs) 和糖最终输出转运蛋白 (SWEETs)。我们将蛋白质序列视为句子,并将其转换为固定长度的有意义向量,其中 768 或 1024 维向量表示每个氨基酸。我们观察到 BERT-Base 和 BERT-Large 模型在平均灵敏度和马修斯相关系数 (MCC) 方面提高了 4%以上的性能,表明了这种方法的效率。我们还开发了一种基于双向转换器的蛋白质模型 (TransportersBERT),与现有的预训练 BERT 模型进行比较。