Suppr超能文献

基于大规模正无标签学习推断蛋白质序列-功能关系。

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.

机构信息

Department of Statistics, The Pennsylvania State University, State College, PA 16802, USA; Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA.

Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA.

出版信息

Cell Syst. 2021 Jan 20;12(1):92-101.e8. doi: 10.1016/j.cels.2020.10.007. Epub 2020 Nov 18.

Abstract

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

摘要

机器学习可以推断蛋白质序列如何映射到功能,而无需深入了解潜在的物理或生物机制。将现有的监督学习框架应用于深度突变扫描 (DMS) 和相关方法生成的大规模实验数据具有挑战性。DMS 数据通常包含高维且相关的序列变量、实验采样误差和偏差,以及存在缺失数据。值得注意的是,大多数 DMS 数据不包含负序列的示例,因此难以直接估计序列如何影响功能。在这里,我们开发了一个正无标记 (PU) 学习框架,以便从大规模 DMS 数据中推断序列-功能关系。我们的 PU 学习方法在十个大型序列-功能数据集上表现出出色的预测性能,这些数据集代表了不同折叠、功能和文库类型的蛋白质。估计的参数指出了决定蛋白质结构和功能的关键残基。最后,我们将我们的统计序列-功能模型应用于设计高度稳定的酶。

相似文献

2
Learning Peptide Properties with Positive Examples Only.仅通过正例学习肽的性质。
bioRxiv. 2023 Jun 5:2023.06.01.543289. doi: 10.1101/2023.06.01.543289.
3
Learning peptide properties with positive examples only.仅通过正例学习肽的特性。
Digit Discov. 2024 Apr 19;3(5):977-986. doi: 10.1039/d3dd00218g. eCollection 2024 May 15.
6
Machine learning to navigate fitness landscapes for protein engineering.机器学习在蛋白质工程中的应用:探索适应度景观
Curr Opin Biotechnol. 2022 Jun;75:102713. doi: 10.1016/j.copbio.2022.102713. Epub 2022 Apr 9.

引用本文的文献

5
Learning peptide properties with positive examples only.仅通过正例学习肽的特性。
Digit Discov. 2024 Apr 19;3(5):977-986. doi: 10.1039/d3dd00218g. eCollection 2024 May 15.

本文引用的文献

1
PUlasso: High-Dimensional Variable Selection With Presence-Only Data.PUlasso:仅存在数据下的高维变量选择
J Am Stat Assoc. 2019;115(529):334-347. doi: 10.1080/01621459.2018.1546587. Epub 2019 Apr 11.
3
Microbial Interaction Network Inference in Microfluidic Droplets.微流控液滴中的微生物相互作用网络推断。
Cell Syst. 2019 Sep 25;9(3):229-242.e4. doi: 10.1016/j.cels.2019.06.008. Epub 2019 Sep 4.
4
Machine-learning-guided directed evolution for protein engineering.基于机器学习的定向进化蛋白质工程。
Nat Methods. 2019 Aug;16(8):687-694. doi: 10.1038/s41592-019-0496-6. Epub 2019 Jul 15.
5
Massively parallel screening of synthetic microbial communities.大规模平行筛选合成微生物群落。
Proc Natl Acad Sci U S A. 2019 Jun 25;116(26):12804-12809. doi: 10.1073/pnas.1900102116. Epub 2019 Jun 11.
8
9
Accurate classification of BRCA1 variants with saturation genome editing.饱和基因组编辑精准分类 BRCA1 变异。
Nature. 2018 Oct;562(7726):217-222. doi: 10.1038/s41586-018-0461-z. Epub 2018 Sep 12.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验