• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在生物分子设计中反馈协变量偏移下的保形预测。

Conformal prediction under feedback covariate shift for biomolecular design.

机构信息

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

Department of Statistics, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2022 Oct 25;119(43):e2204569119. doi: 10.1073/pnas.2204569119. Epub 2022 Oct 18.

DOI:10.1073/pnas.2204569119
PMID:36256807
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9618043/
Abstract

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.

摘要

许多机器学习方法的应用都涉及一个迭代协议,在该协议中,数据被收集,模型被训练,然后该模型的输出用于选择下一步要考虑的数据。例如,设计蛋白质的一种数据驱动方法是训练回归模型来预测蛋白质序列的适应性,然后使用它来提出新的序列,这些序列被认为比训练数据中观察到的适应性更强。由于在湿实验室中验证设计序列通常成本高昂,因此量化模型预测的不确定性非常重要。这是具有挑战性的,因为在设计环境中会出现训练数据和测试数据之间的一种特征分布转移,即训练数据和测试数据在统计学上是相关的,因为后者是根据前者选择的。因此,模型在测试数据上的误差(即设计序列)与其在训练数据上的误差之间存在未知的、可能复杂的关系。我们引入了一种在这种情况下构建预测置信集的方法,该方法考虑了训练数据和测试数据之间的相关性。我们构建的置信集具有有限样本保证,适用于任何回归模型,即使它被用于选择测试时的输入分布。作为一个有启发性的用例,我们使用真实数据集演示了我们的方法如何量化设计蛋白质预测适应性的不确定性,因此可以用于选择在高预测适应性和低预测不确定性之间实现可接受权衡的设计算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/d96f556d8bef/pnas.2204569119fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/59bf48a52319/pnas.2204569119fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/9c8173043484/pnas.2204569119fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/0dc606199b25/pnas.2204569119fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/031b0d3d644c/pnas.2204569119fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/d96f556d8bef/pnas.2204569119fig05.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/59bf48a52319/pnas.2204569119fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/9c8173043484/pnas.2204569119fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/0dc606199b25/pnas.2204569119fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/031b0d3d644c/pnas.2204569119fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7097/9618043/d96f556d8bef/pnas.2204569119fig05.jpg

相似文献

1
Conformal prediction under feedback covariate shift for biomolecular design.在生物分子设计中反馈协变量偏移下的保形预测。
Proc Natl Acad Sci U S A. 2022 Oct 25;119(43):e2204569119. doi: 10.1073/pnas.2204569119. Epub 2022 Oct 18.
2
Deep convolutional neural network and IoT technology for healthcare.用于医疗保健的深度卷积神经网络和物联网技术。
Digit Health. 2024 Jan 17;10:20552076231220123. doi: 10.1177/20552076231220123. eCollection 2024 Jan-Dec.
3
Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design.利用机器学习中的不确定性加速生物学发现和设计。
Cell Syst. 2020 Nov 18;11(5):461-477.e9. doi: 10.1016/j.cels.2020.09.007. Epub 2020 Oct 15.
4
Predicting With Confidence: Using Conformal Prediction in Drug Discovery.有信心的预测:在药物发现中使用一致性预测。
J Pharm Sci. 2021 Jan;110(1):42-49. doi: 10.1016/j.xphs.2020.09.055. Epub 2020 Oct 17.
5
An ensemble-based approach to estimate confidence of predicted protein-ligand binding affinity values.基于集成的方法估计预测蛋白配体结合亲和力值的置信度。
Mol Inform. 2024 Apr;43(4):e202300292. doi: 10.1002/minf.202300292. Epub 2024 Feb 15.
6
Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search.成对差异回归:一种用于改进化学搜索中预测和不确定性量化的机器学习元算法。
J Chem Inf Model. 2021 Aug 23;61(8):3846-3857. doi: 10.1021/acs.jcim.1c00670. Epub 2021 Aug 4.
7
Deep Learning-Based Conformal Prediction of Toxicity.基于深度学习的毒性保形预测。
J Chem Inf Model. 2021 Jun 28;61(6):2648-2657. doi: 10.1021/acs.jcim.1c00208. Epub 2021 May 27.
8
Uncertainty quantification for probabilistic machine learning in earth observation using conformal prediction.使用共形预测对地球观测中的概率机器学习进行不确定性量化。
Sci Rep. 2024 Jul 13;14(1):16166. doi: 10.1038/s41598-024-65954-w.
9
Prediction sets adaptive to unknown covariate shift.适应未知协变量转移的预测集
J R Stat Soc Series B Stat Methodol. 2023 Jul 17;85(5):1680-1705. doi: 10.1093/jrsssb/qkad069. eCollection 2023 Nov.
10
General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.定量构效关系预测分子活性的误差估计的一般方法。
J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

引用本文的文献

1
Reliable machine learning models in genomic medicine using conformal prediction.使用共形预测的基因组医学中的可靠机器学习模型。
Front Bioinform. 2025 Feb 24;5:1507448. doi: 10.3389/fbinf.2025.1507448. eCollection 2025.
2
Active learning-assisted directed evolution.主动学习辅助的定向进化
Nat Commun. 2025 Jan 16;16(1):714. doi: 10.1038/s41467-025-55987-8.
3
Benchmarking uncertainty quantification for protein engineering.蛋白质工程中基准不确定性量化

本文引用的文献

1
Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy.基于机器学习的文库设计中的最优权衡控制,应用于腺相关病毒(AAV)基因治疗。
Sci Adv. 2024 Jan 26;10(4):eadj3786. doi: 10.1126/sciadv.adj3786. Epub 2024 Jan 24.
2
Learning protein fitness models from evolutionary and assay-labeled data.从进化和实验标记数据中学习蛋白质适应性模型。
Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.
3
On the sparsity of fitness functions and implications for learning.
PLoS Comput Biol. 2025 Jan 7;21(1):e1012639. doi: 10.1371/journal.pcbi.1012639. eCollection 2025 Jan.
4
Functional protein mining with conformal guarantees.具有共形保证的功能蛋白质挖掘。
Nat Commun. 2025 Jan 2;16(1):85. doi: 10.1038/s41467-024-55676-y.
5
ProteinReDiff: Complex-based ligand-binding proteins redesign by equivariant diffusion-based generative models.ProteinReDiff:基于等变扩散生成模型的基于复合物的配体结合蛋白重新设计
Struct Dyn. 2024 Nov 25;11(6):064102. doi: 10.1063/4.0000271. eCollection 2024 Nov.
6
Online Calibrated and Conformal Prediction Improves Bayesian Optimization.在线校准与共形预测改进贝叶斯优化。
Proc Mach Learn Res. 2024 May;238:1450-1458.
7
mbtransfer: Microbiome intervention analysis using transfer functions and mirror statistics.基于转移函数和镜像统计的微生物组干预分析
PLoS Comput Biol. 2024 Jun 14;20(6):e1012196. doi: 10.1371/journal.pcbi.1012196. eCollection 2024 Jun.
8
A systematic analysis of regression models for protein engineering.蛋白质工程中回归模型的系统分析。
PLoS Comput Biol. 2024 May 3;20(5):e1012061. doi: 10.1371/journal.pcbi.1012061. eCollection 2024 May.
9
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.机器学习辅助酶工程面临的机遇与挑战
ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28.
10
Safe and reliable transport of prediction models to new healthcare settings without the need to collect new labeled data.预测模型可安全可靠地传输至新的医疗环境,无需收集新的标注数据。
medRxiv. 2023 Dec 21:2023.12.13.23299899. doi: 10.1101/2023.12.13.23299899.
关于适应度函数的稀疏性及其对学习的影响。
Proc Natl Acad Sci U S A. 2022 Jan 4;119(1). doi: 10.1073/pnas.2109649118.
4
Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production.机器学习指导酰基辅酶 A 还原酶工程提高体内脂肪醇产量。
Nat Commun. 2021 Oct 5;12(1):5825. doi: 10.1038/s41467-021-25831-w.
5
Evidential Deep Learning for Guided Molecular Property Prediction and Discovery.用于指导分子性质预测与发现的证据深度学习
ACS Cent Sci. 2021 Aug 25;7(8):1356-1367. doi: 10.1021/acscentsci.1c00546. Epub 2021 Jul 27.
6
Informed training set design enables efficient machine learning-assisted directed protein evolution.知情训练集设计可实现高效的机器学习辅助定向蛋白质进化。
Cell Syst. 2021 Nov 17;12(11):1026-1045.e7. doi: 10.1016/j.cels.2021.07.008. Epub 2021 Aug 19.
7
Protein sequence design with deep generative models.利用深度生成模型进行蛋白质序列设计。
Curr Opin Chem Biol. 2021 Dec;65:18-27. doi: 10.1016/j.cbpa.2021.04.004. Epub 2021 May 26.
8
Protein design and variant prediction using autoregressive generative models.使用自回归生成模型进行蛋白质设计和变体预测。
Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.
9
Machine learning guided aptamer refinement and discovery.机器学习指导的适体优化与发现。
Nat Commun. 2021 Apr 22;12(1):2366. doi: 10.1038/s41467-021-22555-9.
10
Low-N protein engineering with data-efficient deep learning.低蛋白工程与数据高效深度学习。
Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.