Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
Comput Biol Chem. 2021 Aug;93:107537. doi: 10.1016/j.compbiolchem.2021.107537. Epub 2021 Jun 29.
Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence.
The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.
主动运输有原发性主动运输和继发性主动运输两种类型,这两种类型都需要消耗能量来运输物质。主动运输机制确实使用蛋白质来协助运输,并在调节离子或小分子逆浓度梯度穿过细胞膜的运输方面发挥着重要作用。在这项研究中,我们从跨膜转运蛋白中对参与这种转运的两种主要类型的蛋白质进行了分类。我们提出了一种支持向量机(SVM),该 SVM 使用来自 Transformer 的双向编码器表示(BERT)的上下文化词嵌入来表示蛋白质序列。BERT 是迁移学习中的一种强大模型,是谷歌开发的一种深度学习语言表示模型,也是自然语言处理(NLP)任务中表现最好的预训练模型之一。我们将来自 BERT 的预训练模型的迁移学习思想应用于从隐藏层中提取固定特征向量,并学习蛋白质序列中氨基酸之间的上下文关系。因此,我们引入了蛋白质的上下文化词表示,以有效地对序列中氨基酸的复杂结构及其上下文的变化进行建模。通过生成上下文信息,我们为相同的氨基酸捕捉到了多个含义,从而揭示了蛋白质序列中特定残基的重要性。
我们使用五折交叉验证和独立测试来评估所提出方法的性能。该方法在分类 1、分类 2 和分类 3 方面的准确率分别达到 85.44%、88.74%和 92.84%。实验结果表明,这种方法可以利用上下文信息胜过其他特征提取方法,有效地对两种类型的主动运输进行分类,并提高整体性能。