利用语言表示学习方法高效识别电子传递链中的蛋白质复合物类别。

Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain.

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, 32003.

Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.

出版信息

Mol Inform. 2020 Oct;39(10):e2000033. doi: 10.1002/minf.202000033. Epub 2020 Jul 16.

DOI:10.1002/minf.202000033

PMID:32598045

Abstract

We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.

摘要

我们在此提出了一种新的方法，基于语言表示学习方法将电子复合物蛋白分为 5 种类型。这个想法源于人类语言和蛋白质序列语言的共同特征，因此使用了先进的自然语言处理技术来提取有用的特征。具体来说，我们采用了迁移学习和词嵌入技术来分析电子复合物序列，并在使用支持向量机算法进行分类之前创建高效的特征集。在 5 折交叉验证过程中，分析了七种基于序列的特征以找到最佳特征。平均而言，我们的最终分类模型在交叉验证数据上的准确率、特异性、灵敏度和 MCC 分别为 96%、96.1%、95.3%和 0.86。对于独立测试数据，相应的性能得分分别为 95.3%、92.6%、94%和 0.87。我们得出结论，使用这些表示学习方法提取的特征，简单的机器学习算法的预测性能与现有的深度神经网络方法在电子复合物分类任务上相当，同时具有更快的特征生成方式。此外，结果还表明，从表示学习方法中学习到的特征与序列基序计数的组合有助于获得更好的性能。

相似文献

Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain.利用语言表示学习方法高效识别电子传递链中的蛋白质复合物类别。

Mol Inform. 2020 Oct;39(10):e2000033. doi: 10.1002/minf.202000033. Epub 2020 Jul 16.

Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model.使用 Chou 的五步法则和不同的词嵌入类型来提高电子传输蛋白预测模型的性能。

IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):1235-1244. doi: 10.1109/TCBB.2020.3010975. Epub 2022 Apr 1.

Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters.利用词嵌入技术有效地表示蛋白质序列，以识别转运蛋白的底物特异性。

Anal Biochem. 2019 Jul 15;577:73-81. doi: 10.1016/j.ab.2019.04.011. Epub 2019 Apr 22.

TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings.TNFPred：基于词嵌入的混合特征识别肿瘤坏死因子。

BMC Med Genomics. 2020 Oct 22;13(Suppl 10):155. doi: 10.1186/s12920-020-00779-w.

ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations.ActTRANS：基于迁移学习和上下文表示的主动转运蛋白的功能分类。

Comput Biol Chem. 2021 Aug;93:107537. doi: 10.1016/j.compbiolchem.2021.107537. Epub 2021 Jun 29.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.基于机器学习方法，利用蛋白质-蛋白质复合物的结合亲和力进行特征选择和分类。

Proteins. 2014 Sep;82(9):2088-96. doi: 10.1002/prot.24564. Epub 2014 Apr 16.

Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods.利用混合特征表示方法鉴定 DNA 结合蛋白。

Molecules. 2017 Sep 22;22(10):1602. doi: 10.3390/molecules22101602.

MfeCNN: Mixture Feature Embedding Convolutional Neural Network for Data Mapping.MfeCNN：用于数据映射的混合特征嵌入卷积神经网络。

IEEE Trans Nanobioscience. 2018 Jul;17(3):165-171. doi: 10.1109/TNB.2018.2841053. Epub 2018 May 28.

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.用于深度蛋白质组学和基因组学的生物序列连续分布式表示

PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用语言表示学习方法高效识别电子传递链中的蛋白质复合物类别。

Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain.

机构信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献