• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).

机构信息

Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA.

Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, 38124, Germany.

出版信息

Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.

DOI:10.1038/s41598-019-38746-w
PMID:30837494
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6401088/
Abstract

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.

摘要

在本文中,我们提出了肽对编码(PPE),这是一种将蛋白质序列普遍划分为常见的可变长度子序列的通用概率分割方法。PPE 分割的思想受到字节对编码(BPE)文本压缩算法的启发,该算法最近在子词神经机器翻译中得到了广泛应用。我们通过添加一个允许序列分割多种方式的采样框架来修改这个算法。PPE 分割步骤可以在大量蛋白质序列(Swiss-Prot)上学习,甚至可以在特定领域的数据集上学习,然后应用于一组未见过的序列。这种表示可以广泛用作蛋白质生物信息学中任何下游机器学习任务的输入。特别是,在这里,我们通过蛋白质基序发现和蛋白质序列嵌入来介绍这种表示。(i)DiMotif:我们提出 DiMotif 作为一种无比对的区分性基序发现方法,并在三个不同的设置中评估该方法在发现蛋白质基序方面的性能:(1)在 20 个不同的基序发现问题上,将 DiMotif 与两种现有的方法进行比较,这些问题是经过实验验证的;(2)基于分类的方法,用于提取整合素、整合素结合蛋白和生物膜形成的基序;(3)在核定位信号的序列模式搜索中。一般来说,DiMotif 获得了较高的召回率,同时在发现经过实验验证的基序方面具有与其他方法相当的 F1 得分。较高的召回率表明,DiMotif 可用于进一步实验研究基序的短名单创建。在基于分类的评估中,提取的基序可以在保留的序列集上以较高的 F1 分数可靠地检测到整合素、整合素结合和生物膜形成相关的蛋白质。(ii)ProtVecX:我们将基于 k-mer 的蛋白质向量(ProtVec)嵌入扩展到使用 PPE 子序列的可变长度蛋白质嵌入。我们表明,新的嵌入方法在酶预测以及毒素预测任务中可以略微优于 ProtVec。此外,我们得出结论,当将其与原始氨基酸 k-mer 特征结合使用时,这些嵌入在蛋白质分类任务中是有益的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/14112be34ce9/41598_2019_38746_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/d968eceee292/41598_2019_38746_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/5c332d49e026/41598_2019_38746_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/d30402af037b/41598_2019_38746_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/877e6ecd560a/41598_2019_38746_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/e316726de7f4/41598_2019_38746_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/c44ee060a8e3/41598_2019_38746_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/14112be34ce9/41598_2019_38746_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/d968eceee292/41598_2019_38746_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/5c332d49e026/41598_2019_38746_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/d30402af037b/41598_2019_38746_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/877e6ecd560a/41598_2019_38746_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/e316726de7f4/41598_2019_38746_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/c44ee060a8e3/41598_2019_38746_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebe7/6401088/14112be34ce9/41598_2019_38746_Fig6_HTML.jpg

相似文献

1
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).蛋白质序列的概率可变长度分割用于判别基序发现 (DiMotif) 和序列嵌入 (ProtVecX)。
Sci Rep. 2019 Mar 5;9(1):3577. doi: 10.1038/s41598-019-38746-w.
2
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.使用DEME算法在DNA和蛋白质序列中发现鉴别性基序。
BMC Bioinformatics. 2007 Oct 15;8:385. doi: 10.1186/1471-2105-8-385.
3
A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.基于蒙特卡罗的框架增强了调控序列基序的发现和解释。
BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.
4
16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入:核苷酸序列有意义的数值特征表示形式,方便下游分析。
PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.
5
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.用于深度蛋白质组学和基因组学的生物序列连续分布式表示
PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.
6
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks.蛋白质中的迁移学习:评估生物信息学任务中新型蛋白质学习表示。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac232.
7
Mining for class-specific motifs in protein sequence classification.蛋白质序列分类中的类特异性基序挖掘。
BMC Bioinformatics. 2013 Mar 15;14:96. doi: 10.1186/1471-2105-14-96.
8
SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins.SLiMFinder:一种用于识别蛋白质中过度表达、趋同进化的短线性基序的概率方法。
PLoS One. 2007 Oct 3;2(10):e967. doi: 10.1371/journal.pone.0000967.
9
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
10
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

引用本文的文献

1
AOPxSVM: A Support Vector Machine for Identifying Antioxidant Peptides Using a Block Substitution Matrix and Amino Acid Composition, Transformation, and Distribution Embeddings.AOPxSVM:一种使用块替换矩阵以及氨基酸组成、转化和分布嵌入来识别抗氧化肽的支持向量机。
Foods. 2025 Jun 6;14(12):2014. doi: 10.3390/foods14122014.
2
AI4Protein: transforming the future of protein design.AI4Protein:变革蛋白质设计的未来。
Sci China Life Sci. 2025 Jun 20. doi: 10.1007/s11427-024-2906-3.
3
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

本文引用的文献

1
Gene2vec: distributed representation of genes based on co-expression.Gene2vec:基于共表达的基因分布式表示。
BMC Genomics. 2019 Feb 4;20(Suppl 1):82. doi: 10.1186/s12864-018-5370-x.
2
DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection.迪塔克萨:用于宿主表型和生物标志物检测的 16S rRNA 的核苷酸对编码。
Bioinformatics. 2019 Jul 15;35(14):2498-2500. doi: 10.1093/bioinformatics/bty954.
3
Identifying antimicrobial peptides using word embedding with deep recurrent neural networks.使用深度递归神经网络的词嵌入来识别抗菌肽。
蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
4
Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa.化合物的域适应性语言建模识别出铜绿假单胞菌的有效病理阻断剂。
Commun Chem. 2025 Apr 11;8(1):114. doi: 10.1038/s42004-025-01484-4.
5
Integration of kinetic data into affinity-based models for improved T cell specificity prediction.将动力学数据整合到基于亲和力的模型中以改进T细胞特异性预测。
Biophys J. 2024 Dec 3;123(23):4115-4122. doi: 10.1016/j.bpj.2024.11.002. Epub 2024 Nov 8.
6
FaSTPACE: a fast and scalable tool for peptide alignment and consensus extraction.FaSTPACE:一种用于肽段比对和共有序列提取的快速且可扩展的工具。
NAR Genom Bioinform. 2024 Aug 21;6(3):lqae103. doi: 10.1093/nargab/lqae103. eCollection 2024 Sep.
7
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。
J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.
8
BERT2DAb: a pre-trained model for antibody representation based on amino acid sequences and 2D-structure.BERT2DAb:基于氨基酸序列和 2D 结构的抗体表示预训练模型。
MAbs. 2023 Jan-Dec;15(1):2285904. doi: 10.1080/19420862.2023.2285904. Epub 2023 Nov 27.
9
Quantitative approaches for decoding the specificity of the human T cell repertoire.解析人类 T 细胞受体特异性的定量方法。
Front Immunol. 2023 Sep 7;14:1228873. doi: 10.3389/fimmu.2023.1228873. eCollection 2023.
10
Prediction of hot spots towards drug discovery by protein sequence embedding with 1D convolutional neural network.通过一维卷积神经网络的蛋白质序列嵌入预测药物发现的热点。
PLoS One. 2023 Sep 18;18(9):e0290899. doi: 10.1371/journal.pone.0290899. eCollection 2023.
Bioinformatics. 2019 Jun 1;35(12):2009-2016. doi: 10.1093/bioinformatics/bty937.
4
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.MicroPheno:使用基于 k -mer 的浅层子样本表示从 16S rRNA 基因测序中预测环境和宿主表型。
Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296.
5
Mut2Vec: distributed representation of cancerous mutations.Mut2Vec:癌性突变的分布式表示。
BMC Med Genomics. 2018 Apr 20;11(Suppl 2):33. doi: 10.1186/s12920-018-0349-7.
6
SLALOM, a flexible method for the identification and statistical analysis of overlapping continuous sequence elements in sequence- and time-series data.SLALOM,一种灵活的方法,用于识别和统计分析序列和时间序列数据中重叠的连续序列元素。
BMC Bioinformatics. 2018 Jan 26;19(1):24. doi: 10.1186/s12859-018-2020-x.
7
Protein classification using modified n-grams and skip-grams.使用改进的 n 元语法和 skip-grams 进行蛋白质分类。
Bioinformatics. 2018 May 1;34(9):1481-1487. doi: 10.1093/bioinformatics/btx823.
8
The "Stressful" Life of Cell Adhesion Molecules: On the Mechanosensitivity of Integrin Adhesome.细胞粘附分子的“压力重重”的生活:论整合素粘附体的机械敏感性
J Biomech Eng. 2018 Feb 1;140(2). doi: 10.1115/1.4038812.
9
Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition.Mol2vec:具有化学直觉的无监督机器学习方法。
J Chem Inf Model. 2018 Jan 22;58(1):27-35. doi: 10.1021/acs.jcim.7b00616. Epub 2018 Jan 10.
10
NLSdb-major update for database of nuclear localization signals and nuclear export signals.NLSdb 重大更新:核定位信号和核输出信号数据库。
Nucleic Acids Res. 2018 Jan 4;46(D1):D503-D508. doi: 10.1093/nar/gkx1021.