从进化和实验标记数据中学习蛋白质适应性模型。

Learning protein fitness models from evolutionary and assay-labeled data.

机构信息

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.

Center for Computational Biology, University of California, Berkeley, USA.

出版信息

Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.

DOI:10.1038/s41587-021-01146-5

PMID:35039677

Abstract

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

摘要

基于机器学习的蛋白质适应性模型通常从无标签、进化相关的序列或具有实验测量标签的变体序列中进行学习。对于只有有限实验数据的情况，最近的工作已经提出了结合这两种信息源的方法。为此，我们提出了一种简单的组合方法，该方法与更复杂的方法具有竞争力，并且平均表现优于它们。我们的方法在基于位置的氨基酸特征上使用岭回归，并结合进化数据建模的一个概率密度特征。在这种方法中，我们发现基于变分自动编码器的概率密度模型的整体性能最好，尽管可以使用任何进化密度模型。此外，我们的分析强调了系统评估和充分基准的重要性。

相似文献

Learning protein fitness models from evolutionary and assay-labeled data.从进化和实验标记数据中学习蛋白质适应性模型。

Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.基于大规模蛋白质语言模型的监督方法的集成学习在蛋白质突变效应预测中的应用。

Int J Mol Sci. 2023 Nov 18;24(22):16496. doi: 10.3390/ijms242216496.

Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort.进化概率和堆叠回归可实现最小化实验投入的数据驱动蛋白质工程。

J Chem Inf Model. 2024 Aug 26;64(16):6350-6360. doi: 10.1021/acs.jcim.4c00704. Epub 2024 Aug 1.

Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合，以预测放射性肺损伤。

Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.

Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment.基于进化建模和结构质量评估的新型 Cas9 PAM 相互作用结构域的计算设计。

PLoS Comput Biol. 2023 Nov 17;19(11):e1011621. doi: 10.1371/journal.pcbi.1011621. eCollection 2023 Nov.

Discriminative Mixture Variational Autoencoder for Semisupervised Classification.用于半监督分类的判别混合变分自编码器。

IEEE Trans Cybern. 2022 May;52(5):3032-3046. doi: 10.1109/TCYB.2020.3023019. Epub 2022 May 19.

EvoRator: Prediction of Residue-level Evolutionary Rates from Protein Structures Using Machine Learning.EvoRator：基于机器学习的蛋白质结构残基水平进化率预测。

J Mol Biol. 2022 Jun 15;434(11):167538. doi: 10.1016/j.jmb.2022.167538. Epub 2022 Mar 11.

Prediction of mutation effects using a deep temporal convolutional network.使用深度时间卷积网络预测突变效应。

Bioinformatics. 2020 Apr 1;36(7):2047-2052. doi: 10.1093/bioinformatics/btz873.

Anomaly Detection in Asset Degradation Process Using Variational Autoencoder and Explanations.基于变分自编码器和解释的资产退化过程中的异常检测。

Sensors (Basel). 2021 Dec 31;22(1):291. doi: 10.3390/s22010291.

Boosting phosphorylation site prediction with sequence feature-based machine learning.基于序列特征的机器学习提高磷酸化位点预测。

Proteins. 2020 Feb;88(2):284-291. doi: 10.1002/prot.25801. Epub 2019 Aug 22.

引用本文的文献

Biophysics-based protein language models for protein engineering.用于蛋白质工程的基于生物物理学的蛋白质语言模型。

Nat Methods. 2025 Sep 11. doi: 10.1038/s41592-025-02776-2.

An iterative deep learning-guided algorithm for directed protein evolution.一种用于定向蛋白质进化的迭代深度学习引导算法。

iScience. 2025 Aug 7;28(9):113324. doi: 10.1016/j.isci.2025.113324. eCollection 2025 Sep 19.

Symmetry, gauge freedoms, and the interpretability of sequence-function relationships.对称性、规范自由度与序列-功能关系的可解释性。

Phys Rev Res. 2025 Apr-Jun;7(2). doi: 10.1103/physrevresearch.7.023005. Epub 2025 Apr 2.

A combinatorial mutational map of active non-native protein kinases by deep learning guided sequence design.通过深度学习引导的序列设计构建活性非天然蛋白激酶的组合突变图谱。

bioRxiv. 2025 Aug 3:2025.08.03.668353. doi: 10.1101/2025.08.03.668353.

PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context.PreMode通过对蛋白质序列和结构背景进行深度图表示学习来预测错义变体的作用模式。

Nat Commun. 2025 Aug 5;16(1):7189. doi: 10.1038/s41467-025-62318-4.

Considering Metabolic Context in Enzyme Evolution and Design.酶进化与设计中的代谢背景考量

Biochemistry. 2025 Aug 19;64(16):3495-3507. doi: 10.1021/acs.biochem.5c00165. Epub 2025 Aug 5.

A Cyanobacterial Screening Platform for Rubisco Mutant Variants.用于核酮糖-1,5-二磷酸羧化酶突变体变体的蓝藻筛选平台。

ACS Synth Biol. 2025 Jul 18;14(7):2619-2633. doi: 10.1021/acssynbio.5c00065. Epub 2025 Jul 7.

A generalized platform for artificial intelligence-powered autonomous enzyme engineering.用于人工智能驱动的自主酶工程的通用平台。

Nat Commun. 2025 Jul 1;16(1):5648. doi: 10.1038/s41467-025-61209-y.

Evolutionary-scale enzymology enables exploration of a rugged catalytic landscape.进化尺度酶学有助于探索崎岖的催化格局。

Science. 2025 Jun 12;388(6752):eadu1058. doi: 10.1126/science.adu1058.

VenusMutHub-A benchmark for protein mutation effect prediction.金星突变库——蛋白质突变效应预测的一个基准。

Acta Pharm Sin B. 2025 May;15(5):2805-2807. doi: 10.1016/j.apsb.2025.05.001. Epub 2025 May 14.

本文引用的文献

Neural networks to learn protein sequence-function relationships from deep mutational scanning data.神经网络从深度突变扫描数据中学习蛋白质序列-功能关系。

Proc Natl Acad Sci U S A. 2021 Nov 30;118(48). doi: 10.1073/pnas.2104878118.

Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions.上位网络允许对深度神经网络进行稀疏谱正则化，以推断适应度函数。

Nat Commun. 2021 Sep 1;12(1):5225. doi: 10.1038/s41467-021-25371-3.

Low-N protein engineering with data-efficient deep learning.低蛋白工程与数据高效深度学习。

Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.

An evolution-based model for designing chorismate mutase enzymes.一种基于进化的分支酸变位酶设计模型。

Science. 2020 Jul 24;369(6502):440-445. doi: 10.1126/science.aba3304.

TLmutation: Predicting the Effects of Mutations Using Transfer Learning.TL突变：使用迁移学习预测突变的影响。

J Phys Chem B. 2020 May 14;124(19):3845-3854. doi: 10.1021/acs.jpcb.0c00197. Epub 2020 May 1.

Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。

Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.

Machine-learning-guided directed evolution for protein engineering.基于机器学习的定向进化蛋白质工程。

Nat Methods. 2019 Aug;16(8):687-694. doi: 10.1038/s41592-019-0496-6. Epub 2019 Jul 15.

Deep generative models of genetic variation capture the effects of mutations.深度生成模型捕获遗传变异的突变效应。

Nat Methods. 2018 Oct;15(10):816-822. doi: 10.1038/s41592-018-0138-4. Epub 2018 Sep 24.

Inferring the shape of global epistasis.推断全球上位性的形状。

Proc Natl Acad Sci U S A. 2018 Aug 7;115(32):E7550-E7558. doi: 10.1073/pnas.1804015115. Epub 2018 Jul 23.

Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data.利用大规模诱变数据进行定量错义变异效应预测。

Cell Syst. 2018 Jan 24;6(1):116-124.e3. doi: 10.1016/j.cels.2017.11.003. Epub 2017 Dec 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从进化和实验标记数据中学习蛋白质适应性模型。

Learning protein fitness models from evolutionary and assay-labeled data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献