扩展蛋白质的词汇：子词算法在蛋白质序列建模中的应用

Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

作者信息

Lennox Mark, Robertson Neil, Devereux Barry

出版信息

Annu Int Conf IEEE Eng Med Biol Soc. 2020 Jul;2020:2361-2367. doi: 10.1109/EMBC44109.2020.9176380.

DOI:10.1109/EMBC44109.2020.9176380

Abstract

Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.

摘要

深度学习已被证明是一种用于蛋白质特性建模的有用工具。然而，鉴于蛋白质长度的变异性，有效总结氨基酸序列可能会很困难。在许多情况下，由于使用固定长度表示，长蛋白质的信息可能会因截断而丢失，或者由于使用过多填充而导致模型训练缓慢。在这项工作中，我们旨在通过扩展用于表示蛋白质序列的原始词汇表来克服这些问题。为此，我们利用了两种著名的子词算法，它们先前已被用于在各种自然语言处理任务中取得最先进的结果。在通过Doc2Vec模型分析之前，这些算法用于将原始蛋白质序列编码为一组子序列。每种算法产生的预训练编码在各种下游任务上进行测试：四个蛋白质特性预测任务（质膜定位、热稳定性、峰值吸收波长、对映选择性）以及两个数据集上的药物-靶点亲和力预测任务。我们的结果在这些任务上显著优于现有技术，证明了使用子词压缩算法对蛋白质进行建模的好处。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

扩展蛋白质的词汇：子词算法在蛋白质序列建模中的应用

Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

作者信息

出版信息

相似文献

扩展蛋白质的词汇：子词算法在蛋白质序列建模中的应用

Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling.

作者信息

出版信息

相似文献