Suppr
超能文献

机器学习预测连续蛋白质特性从二进制细胞排序数据和映射未见序列空间。

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space.

机构信息

Chemical Engineering, University of Michigan, Ann Arbor, MI 48109.

Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109.

出版信息

Proc Natl Acad Sci U S A. 2024 Mar 12;121(11):e2311726121. doi: 10.1073/pnas.2311726121. Epub 2024 Mar 7.

DOI:10.1073/pnas.2311726121

PMID:38451939

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10945751/

Abstract

Proteins are a diverse class of biomolecules responsible for wide-ranging cellular functions, from catalyzing reactions to recognizing pathogens. The ability to evolve proteins rapidly and inexpensively toward improved properties is a common objective for protein engineers. Powerful high-throughput methods like fluorescent activated cell sorting and next-generation sequencing have dramatically improved directed evolution experiments. However, it is unclear how to best leverage these data to characterize protein fitness landscapes more completely and identify lead candidates. In this work, we develop a simple yet powerful framework to improve protein optimization by predicting continuous protein properties from simple directed evolution experiments using interpretable, linear machine learning models. Importantly, we find that these models, which use data from simple but imprecise experimental estimates of protein fitness, have predictive capabilities that approach more precise but expensive data. Evaluated across five diverse protein engineering tasks, continuous properties are consistently predicted from readily available deep sequencing data, demonstrating that protein fitness space can be reasonably well modeled by linear relationships among sequence mutations. To prospectively test the utility of this approach, we generated a library of stapled peptides and applied the framework to predict affinity and specificity from simple cell sorting data. We then coupled integer linear programming, a method to optimize protein fitness from linear weights, with mutation scores from machine learning to identify variants in unseen sequence space that have improved and co-optimal properties. This approach represents a versatile tool for improved analysis and identification of protein variants across many domains of protein engineering.

摘要

蛋白质是一类具有广泛细胞功能的生物分子，从催化反应到识别病原体。快速且廉价地进化蛋白质以提高其性能是蛋白质工程师的共同目标。强大的高通量方法，如荧光激活细胞分选和下一代测序，极大地改进了定向进化实验。然而，目前尚不清楚如何最好地利用这些数据更全面地描述蛋白质适应性景观并确定领先的候选者。在这项工作中，我们开发了一个简单而强大的框架，通过使用可解释的线性机器学习模型从简单的定向进化实验中预测连续的蛋白质特性，从而改进蛋白质优化。重要的是，我们发现这些模型使用来自简单但不精确的蛋白质适应性实验估计的简单数据，可以接近更精确但昂贵的数据，从而具有预测能力。在五个不同的蛋白质工程任务中进行评估，连续特性可以从易于获得的深度测序数据中得到一致的预测，这表明蛋白质适应性空间可以通过序列突变之间的线性关系进行合理的建模。为了前瞻性地测试这种方法的实用性，我们生成了一个订书肽文库，并将该框架应用于从简单的细胞分选数据中预测亲和力和特异性。然后，我们将整数线性规划（一种从线性权重优化蛋白质适应性的方法）与机器学习的突变评分相结合，以识别具有改进和协同优化特性的未见序列空间中的变体。这种方法代表了一种改进蛋白质工程许多领域中蛋白质变体分析和鉴定的多功能工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec3b/10945751/4b76aec1b2d1/pnas.2311726121fig01.jpg

相似文献

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space.

Proc Natl Acad Sci U S A. 2024 Mar 12;121(11):e2311726121. doi: 10.1073/pnas.2311726121. Epub 2024 Mar 7.

Machine learning to navigate fitness landscapes for protein engineering.

Curr Opin Biotechnol. 2022 Jun;75:102713. doi: 10.1016/j.copbio.2022.102713. Epub 2022 Apr 9.

PyPEF-An Integrated Framework for Data-Driven Protein Engineering.

J Chem Inf Model. 2021 Jul 26;61(7):3463-3476. doi: 10.1021/acs.jcim.1c00099. Epub 2021 Jul 14.

Machine-learning-guided directed evolution for protein engineering.

Nat Methods. 2019 Aug;16(8):687-694. doi: 10.1038/s41592-019-0496-6. Epub 2019 Jul 15.

Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort.

J Chem Inf Model. 2024 Aug 26;64(16):6350-6360. doi: 10.1021/acs.jcim.4c00704. Epub 2024 Aug 1.

Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering.

Nat Commun. 2024 Jul 29;15(1):6392. doi: 10.1038/s41467-024-50698-y.

In vitro continuous protein evolution empowered by machine learning and automation.

Cell Syst. 2023 Aug 16;14(8):633-644. doi: 10.1016/j.cels.2023.04.006. Epub 2023 May 23.

DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering.

ACS Synth Biol. 2023 Aug 18;12(8):2444-2454. doi: 10.1021/acssynbio.3c00301. Epub 2023 Jul 31.

Predicting mutant outcome by combining deep mutational scanning and machine learning.

Proteins. 2022 Jan;90(1):45-57. doi: 10.1002/prot.26184. Epub 2021 Jul 31.

Machine learning-assisted directed protein evolution with combinatorial libraries.

Proc Natl Acad Sci U S A. 2019 Apr 30;116(18):8852-8858. doi: 10.1073/pnas.1901979116. Epub 2019 Apr 12.

引用本文的文献

CaML: Chemistry-informed machine learning explains mutual changes between protein conformations and calcium ions in calcium-binding proteins using structural and topological features.

Protein Sci. 2025 Feb;34(2):e70023. doi: 10.1002/pro.70023.

Reaching New Heights in Genetic Code Manipulation with High Throughput Screening.

Chem Rev. 2024 Nov 13;124(21):12145-12175. doi: 10.1021/acs.chemrev.4c00329. Epub 2024 Oct 17.

Chemistry-informed Machine Learning Explains Calcium-binding Proteins' Fuzzy Shape for Communicating Changes in the Atomic States of Calcium Ions.

ArXiv. 2024 Jul 24:arXiv:2407.17017v1.

本文引用的文献

Efficient evolution of human antibodies from general protein language models.

Nat Biotechnol. 2024 Feb;42(2):275-283. doi: 10.1038/s41587-023-01763-2. Epub 2023 Apr 24.

Rapid Evaluation of Staple Placement in Stabilized α Helices Using Bacterial Surface Display.

ACS Chem Biol. 2023 Apr 21;18(4):905-914. doi: 10.1021/acschembio.3c00048. Epub 2023 Apr 11.

Reduction of therapeutic antibody self-association using yeast-display selections and machine learning.

MAbs. 2022 Jan-Dec;14(1):2146629. doi: 10.1080/19420862.2022.2146629.

Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space.

Nat Commun. 2022 Jul 1;13(1):3788. doi: 10.1038/s41467-022-31457-3.

Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution.

Science. 2022 Jul 22;377(6604):420-424. doi: 10.1126/science.abo7896. Epub 2022 Jun 28.

Heterogeneity of the GFP fitness landscape and data-driven protein design.

Elife. 2022 May 5;11:e75842. doi: 10.7554/eLife.75842.

Learning protein fitness models from evolutionary and assay-labeled data.

Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.

Highly sensitive detection of antibody nonspecific interactions using flow cytometry.

MAbs. 2021 Jan-Dec;13(1):1951426. doi: 10.1080/19420862.2021.1951426.

CellectSeq: In silico discovery of antibodies targeting integral membrane proteins combining in situ selections and next-generation sequencing.

Commun Biol. 2021 May 12;4(1):561. doi: 10.1038/s42003-021-02066-5.

Protein design and variant prediction using autoregressive generative models.

Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

机器学习预测连续蛋白质特性从二进制细胞排序数据和映射未见序列空间。

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译