Suppr超能文献

利用语言表示学习方法高效识别电子传递链中的蛋白质复合物类别。

Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain.

机构信息

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, 32003.

Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.

出版信息

Mol Inform. 2020 Oct;39(10):e2000033. doi: 10.1002/minf.202000033. Epub 2020 Jul 16.

Abstract

We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5-fold cross-validation processes, seven types of sequence-based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96 %, 96.1 %, 95.3 %, and 0.86, respectively on cross-validation data. For the independent test data, those corresponding performance scores are 95.3 %, 92.6 %, 94 %, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.

摘要

我们在此提出了一种新的方法,基于语言表示学习方法将电子复合物蛋白分为 5 种类型。这个想法源于人类语言和蛋白质序列语言的共同特征,因此使用了先进的自然语言处理技术来提取有用的特征。具体来说,我们采用了迁移学习和词嵌入技术来分析电子复合物序列,并在使用支持向量机算法进行分类之前创建高效的特征集。在 5 折交叉验证过程中,分析了七种基于序列的特征以找到最佳特征。平均而言,我们的最终分类模型在交叉验证数据上的准确率、特异性、灵敏度和 MCC 分别为 96%、96.1%、95.3%和 0.86。对于独立测试数据,相应的性能得分分别为 95.3%、92.6%、94%和 0.87。我们得出结论,使用这些表示学习方法提取的特征,简单的机器学习算法的预测性能与现有的深度神经网络方法在电子复合物分类任务上相当,同时具有更快的特征生成方式。此外,结果还表明,从表示学习方法中学习到的特征与序列基序计数的组合有助于获得更好的性能。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验