Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA.
Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, 38124, Germany.
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
在本文中,我们提出了肽对编码(PPE),这是一种将蛋白质序列普遍划分为常见的可变长度子序列的通用概率分割方法。PPE 分割的思想受到字节对编码(BPE)文本压缩算法的启发,该算法最近在子词神经机器翻译中得到了广泛应用。我们通过添加一个允许序列分割多种方式的采样框架来修改这个算法。PPE 分割步骤可以在大量蛋白质序列(Swiss-Prot)上学习,甚至可以在特定领域的数据集上学习,然后应用于一组未见过的序列。这种表示可以广泛用作蛋白质生物信息学中任何下游机器学习任务的输入。特别是,在这里,我们通过蛋白质基序发现和蛋白质序列嵌入来介绍这种表示。(i)DiMotif:我们提出 DiMotif 作为一种无比对的区分性基序发现方法,并在三个不同的设置中评估该方法在发现蛋白质基序方面的性能:(1)在 20 个不同的基序发现问题上,将 DiMotif 与两种现有的方法进行比较,这些问题是经过实验验证的;(2)基于分类的方法,用于提取整合素、整合素结合蛋白和生物膜形成的基序;(3)在核定位信号的序列模式搜索中。一般来说,DiMotif 获得了较高的召回率,同时在发现经过实验验证的基序方面具有与其他方法相当的 F1 得分。较高的召回率表明,DiMotif 可用于进一步实验研究基序的短名单创建。在基于分类的评估中,提取的基序可以在保留的序列集上以较高的 F1 分数可靠地检测到整合素、整合素结合和生物膜形成相关的蛋白质。(ii)ProtVecX:我们将基于 k-mer 的蛋白质向量(ProtVec)嵌入扩展到使用 PPE 子序列的可变长度蛋白质嵌入。我们表明,新的嵌入方法在酶预测以及毒素预测任务中可以略微优于 ProtVec。此外,我们得出结论,当将其与原始氨基酸 k-mer 特征结合使用时,这些嵌入在蛋白质分类任务中是有益的。