• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Embeddings from protein language models predict conservation and variant effects.

机构信息

Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany.

TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.

出版信息

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

DOI:10.1007/s00439-021-02411-y
PMID:34967936
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8716573/
Abstract

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

摘要

SARS-CoV-2 变体的出现强调了需要有工具来解释单个氨基酸变异 (SAV) 对蛋白质功能的影响。虽然深度突变扫描 (DMS) 集继续扩展我们对单个蛋白质突变景观的理解,但结果继续挑战分析。蛋白质语言模型 (pLM) 使用最新的深度学习 (DL) 算法利用不断增长的蛋白质序列数据库。这些方法学会从整个序列区域的上下文预测缺失或屏蔽的氨基酸。在这里,我们使用 pLM 表示 (嵌入) 来预测序列保守性和 SAV 效应,而无需进行多重序列比对 (MSA)。仅嵌入就可以从单个序列中几乎与使用 MSA 的 ConSeq 一样准确地预测残基保守性 (对于 ProtT5 嵌入,两状态马修斯相关系数-MCC-为 0.596±0.006 与 ConSeq 的 0.608±0.006)。将保守性预测与 BLOSUM62 替换分数以及 pLM 掩蔽重建概率一起输入到无需比对的变体效应得分预测的简单逻辑回归 (LR) 集成中 (VESPA) 无需对 DMS 数据进行任何优化即可预测 SAV 效应幅度。将一组 39 个 DMS 实验的标准集与其他方法 (包括 ESM-1v、DeepSequence 和 GEMME) 的预测进行比较,结果表明,我们的方法与使用 MSA 输入的最先进 (SOTA) 方法具有竞争力。没有一种方法始终优于所有其他方法,无论是在应用的性能度量上,还是在统计上都没有显著差异。最后,我们研究了四种人类蛋白质的 DMS 实验的二元效应预测。总体而言,基于嵌入的方法已经在 SAV 效应预测方面与依赖 MSA 的方法竞争,其计算/能量成本只是一小部分。我们的方法在一台 Nvidia Quadro RTX 8000 上仅需 40 分钟即可预测整个人类蛋白质组 (~20k 个蛋白质) 的 SAV 效应。所有方法和数据集均可通过 bioembeddings.com、https://github.com/Rostlab/VESPA 和 PredictProtein 免费用于本地和在线执行。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/7b33bb4b0783/439_2021_2411_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/05c1ada6e960/439_2021_2411_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/bb8caa22e047/439_2021_2411_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/4376f3549c74/439_2021_2411_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/ccd2c24f15c3/439_2021_2411_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/f5c1e76930dc/439_2021_2411_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/7b33bb4b0783/439_2021_2411_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/05c1ada6e960/439_2021_2411_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/bb8caa22e047/439_2021_2411_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/4376f3549c74/439_2021_2411_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/ccd2c24f15c3/439_2021_2411_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/f5c1e76930dc/439_2021_2411_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fd69/9522817/7b33bb4b0783/439_2021_2411_Fig6_HTML.jpg

相似文献

1
Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。
Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.
2
SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。
Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.
3
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。
Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.
4
Assessing the role of evolutionary information for enhancing protein language model embeddings.评估进化信息在增强蛋白质语言模型嵌入中的作用。
Sci Rep. 2024 Sep 5;14(1):20692. doi: 10.1038/s41598-024-71783-8.
5
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
6
TMbed: transmembrane proteins predicted through language model embeddings.TMbed:通过语言模型嵌入预测的跨膜蛋白。
BMC Bioinformatics. 2022 Aug 8;23(1):326. doi: 10.1186/s12859-022-04873-x.
7
LambdaPP: Fast and accessible protein-specific phenotype predictions.LambdaPP:快速且易于使用的蛋白质特异性表型预测。
Protein Sci. 2023 Jan;32(1):e4524. doi: 10.1002/pro.4524.
8
TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion.TransEFVP:基于蛋白质序列嵌入融合的人类致病变体预测两阶段方法。
J Chem Inf Model. 2024 Feb 26;64(4):1407-1418. doi: 10.1021/acs.jcim.3c02019. Epub 2024 Feb 9.
9
Protein embeddings predict binding residues in disordered regions.蛋白质嵌入预测无序区域的结合残基。
Sci Rep. 2024 Jun 12;14(1):13566. doi: 10.1038/s41598-024-64211-4.
10
Variant effect predictions capture some aspects of deep mutational scanning experiments.变异效应预测捕捉到了深度突变扫描实验的一些方面。
BMC Bioinformatics. 2020 Mar 17;21(1):107. doi: 10.1186/s12859-020-3439-4.

引用本文的文献

1
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
2
From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model.从高通量评估到湿实验室研究:利用检索增强模型推进突变效应预测
Bioinformatics. 2025 Jul 1;41(Supplement_1):i401-i409. doi: 10.1093/bioinformatics/btaf189.
3
Progress and challenges for the application of machine learning for neglected tropical diseases.

本文引用的文献

1
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
2
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。
Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.
3
Protein embeddings and deep learning predict binding residues for various ligand classes.
机器学习在 neglected tropical diseases 中的应用进展与挑战。 (注:“neglected tropical diseases”直译为“被忽视的热带病” )
F1000Res. 2025 May 20;12:287. doi: 10.12688/f1000research.129064.2. eCollection 2023.
4
StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.StackGlyEmbed:使用蛋白质语言模型预测N-糖基化位点
Bioinform Adv. 2025 Jun 28;5(1):vbaf146. doi: 10.1093/bioadv/vbaf146. eCollection 2025.
5
AFToolkit: a framework for molecular modeling of proteins with AlphaFold-derived representations.AFToolkit:一个用于基于AlphaFold衍生表示进行蛋白质分子建模的框架。
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf324.
6
Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.超快经典系统发育方法在变异效应预测方面胜过大型蛋白质语言模型。
Adv Neural Inf Process Syst. 2024;37:130265-130290.
7
VenusMutHub: A systematic evaluation of protein mutation effect predictors on small-scale experimental data.金星突变中心:基于小规模实验数据对蛋白质突变效应预测因子的系统评估。
Acta Pharm Sin B. 2025 May;15(5):2454-2467. doi: 10.1016/j.apsb.2025.03.028. Epub 2025 Mar 14.
8
VenusMutHub-A benchmark for protein mutation effect prediction.金星突变库——蛋白质突变效应预测的一个基准。
Acta Pharm Sin B. 2025 May;15(5):2805-2807. doi: 10.1016/j.apsb.2025.05.001. Epub 2025 May 14.
9
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
10
Machine learning models for pharmacogenomic variant effect predictions - recent developments and future frontiers.用于药物基因组变异效应预测的机器学习模型——最新进展与未来前沿
Pharmacogenomics. 2025 Apr-Apr;26(5-6):171-182. doi: 10.1080/14622416.2025.2504863. Epub 2025 May 22.
蛋白质嵌入和深度学习预测各种配体类的结合残基。
Sci Rep. 2021 Dec 13;11(1):23916. doi: 10.1038/s41598-021-03431-4.
4
SARS-CoV-2 structural coverage map reveals viral protein assembly, mimicry, and hijacking mechanisms.SARS-CoV-2 结构覆盖图揭示了病毒蛋白的组装、模拟和劫持机制。
Mol Syst Biol. 2021 Sep;17(9):e10079. doi: 10.15252/msb.202010079.
5
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
6
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans:通过自监督学习理解生命语言。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.
7
Editorial overview: Sequences and topology: 'paths from sequence to structure'.编辑概述:序列与拓扑结构:“从序列到结构的路径”
Curr Opin Struct Biol. 2021 Jun;68:vi-viii. doi: 10.1016/j.sbi.2021.05.005.
8
Learning the protein language: Evolution, structure, and function.学习蛋白质语言:进化、结构和功能。
Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.
9
PredictProtein - Predicting Protein Structure and Function for 29 Years.PredictProtein - 预测蛋白质结构和功能 29 年。
Nucleic Acids Res. 2021 Jul 2;49(W1):W535-W540. doi: 10.1093/nar/gkab354.
10
Clustering FunFams using sequence embeddings improves EC purity.使用序列嵌入对功能家族进行聚类可提高酶委员会(EC)纯度。
Bioinformatics. 2021 Oct 25;37(20):3449-3455. doi: 10.1093/bioinformatics/btab371.