基于蛋白质序列嵌入的无比对序列保守性估计用于识别功能位点。

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

机构信息

Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.

School of Computing, University of Georgia, 30602, Georgia, USA.

出版信息

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac599.

DOI:10.1093/bib/bbac599

PMID:36631405

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9851297/

Abstract

Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

摘要

蛋白质语言模型是生物信息学中一种快速发展的深度学习方法，具有多种应用，如结构预测和蛋白质设计。然而，其在功能位点预测方面的序列保守性估计应用尚未得到系统的探索。在这里，我们提出了一种使用蛋白质语言模型生成的序列嵌入来进行无对齐估计序列保守性的方法。在公开可用的蛋白质语言模型中进行的综合基准测试表明，ESM2 模型在保守性估计的计算成本方面提供了最佳的性能比。将我们的方法应用于全长蛋白质序列，我们证明基于嵌入的方法不受保守元素顺序的影响——可以在单个运行中计算多结构域蛋白质的保守得分，而无需分离单个结构域。我们的方法还可以识别快速进化序列区域（如结构域插入）中的保守功能位点，我们通过在蛋白激酶的可变插入片段中鉴定保守的磷酸化模体来证明这一点。总的来说，基于嵌入的保守性分析是一种广泛适用于识别任何全长蛋白质序列中潜在功能位点并以无对齐方式估计保守性的方法。要在您感兴趣的蛋白质序列上运行此方法，请访问我们的 GitHub 页面 https://github.com/esbgkannan/kibby 尝试我们的脚本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8f5/9851297/d1330835d4d2/bbac599f1.jpg

相似文献

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac599.

Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design.

Nucleic Acids Res. 2005 Oct 13;33(18):5861-7. doi: 10.1093/nar/gki894. Print 2005.

Modeling aspects of the language of life through transfer-learning protein sequences.

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.

Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.

Variation in structural location and amino acid conservation of functional sites in protein domain families.

BMC Bioinformatics. 2005 Aug 25;6:210. doi: 10.1186/1471-2105-6-210.

AL2CO: calculation of positional conservation in a protein sequence alignment.

Bioinformatics. 2001 Aug;17(8):700-12. doi: 10.1093/bioinformatics/17.8.700.

Improving position-specific predictions of protein functional sites using phylogenetic motifs.

Bioinformatics. 2008 Oct 15;24(20):2308-16. doi: 10.1093/bioinformatics/btn454. Epub 2008 Aug 21.

Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

PLoS One. 2013 Sep 12;8(9):e75458. doi: 10.1371/journal.pone.0075458. eCollection 2013.

Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac619.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Functional Screen of Wilson Disease ATP7B Variants Reveals Residual Transport Activities.

Hum Mutat. 2025 Jul 7;2025:7485658. doi: 10.1155/humu/7485658. eCollection 2025.

Transforming a Historical Chemical Synthetic Route for Vanillin Starting from Renewable Eugenol to a Cell-Free Bi-Enzymatic Cascade.

ChemSusChem. 2025 Jun 2;18(11):e202500387. doi: 10.1002/cssc.202500387. Epub 2025 Apr 16.

Accurate prediction of nucleic acid binding proteins using protein language model.

Bioinform Adv. 2025 Jan 20;5(1):vbaf008. doi: 10.1093/bioadv/vbaf008. eCollection 2025.

Detection of circular permutations by Protein Language Models.

Comput Struct Biotechnol J. 2024 Dec 30;27:214-220. doi: 10.1016/j.csbj.2024.12.029. eCollection 2025.

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions.

Protein Sci. 2025 Jan;34(1):e70004. doi: 10.1002/pro.70004.

Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length.

Molecules. 2024 Sep 29;29(19):4616. doi: 10.3390/molecules29194616.

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions.

bioRxiv. 2024 Jul 24:2024.07.23.604860. doi: 10.1101/2024.07.23.604860.

Informatic challenges and advances in illuminating the druggable proteome.

Drug Discov Today. 2024 Mar;29(3):103894. doi: 10.1016/j.drudis.2024.103894. Epub 2024 Jan 22.

MSTL-Kace: Prediction of Prokaryotic Lysine Acetylation Sites Based on Multistage Transfer Learning Strategy.

ACS Omega. 2023 Oct 25;8(44):41930-41942. doi: 10.1021/acsomega.3c07086. eCollection 2023 Nov 7.

本文引用的文献

Embeddings from protein language models predict conservation and variant effects.

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.

Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444. doi: 10.1093/nar/gkab1061.

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Learning the protein language: Evolution, structure, and function.

Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.

The language of proteins: NLP, machine learning & protein sequences.

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

UniProt: the universal protein knowledgebase in 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.

Pfam: The protein families database in 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419. doi: 10.1093/nar/gkaa913.

CDD/SPARCLE: the conserved domain database in 2020.

Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268. doi: 10.1093/nar/gkz991.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于蛋白质序列嵌入的无比对序列保守性估计用于识别功能位点。

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

机构信息

Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.

School of Computing, University of Georgia, 30602, Georgia, USA.

出版信息

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac599.

DOI:10.1093/bib/bbac599

PMID:36631405

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9851297/

Abstract

摘要

基于蛋白质序列嵌入的无比对序列保守性估计用于识别功能位点。

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于蛋白质序列嵌入的无比对序列保守性估计用于识别功能位点。

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献