• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

研究蛋白质适应性预测机器学习中性能的决定因素。

Investigating the determinants of performance in machine learning for protein fitness prediction.

作者信息

Sandhu Mahakaran, Mater Adam C, Matthews Dana S, Spence Matthew A, Lenskiy Artem A, Jackson Colin

机构信息

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.

ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.

出版信息

Protein Sci. 2025 Aug;34(8):e70235. doi: 10.1002/pro.70235.

DOI:10.1002/pro.70235
PMID:40689706
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12278695/
Abstract

Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.

摘要

机器学习(ML)彻底改变了蛋白质生物学,解决了蛋白质折叠、支架生成和功能设计任务中长期存在的问题。一系列架构在监督式蛋白质适应性预测任务中已取得成功。然而,由于缺乏合理的方法来评估哪种架构最适合特定数据集和工程任务,架构选择仍然具有挑战性。在此,我们提出了一个框架,用于研究一系列ML架构成功的决定因素。使用模拟(NK模型)和经验性适应性景观,我们沿着六个关键性能指标来衡量序列适应性预测:训练域内的插值、训练域外的外推、对上位性增加/崎岖度的鲁棒性、进行位置外推的能力、对稀疏训练数据的鲁棒性以及对序列长度的敏感性。我们表明,算法之间的架构差异在实验和理论景观中始终会影响这些指标的性能。此外,景观崎岖度成为序列适应性预测准确性的主要决定因素。我们的方法和结果为基于适应性景观理论的实验数据采样、模型选择和评估提供了一种合理策略,我们希望这种策略将提高序列适应性预测的准确性,并对蛋白质工程和变异功能预测产生影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/d157041ba502/PRO-34-e70235-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/a471a75b2327/PRO-34-e70235-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/872a14f3803a/PRO-34-e70235-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/508a9e25335d/PRO-34-e70235-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/3d430eaef87f/PRO-34-e70235-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/d157041ba502/PRO-34-e70235-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/a471a75b2327/PRO-34-e70235-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/872a14f3803a/PRO-34-e70235-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/508a9e25335d/PRO-34-e70235-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/3d430eaef87f/PRO-34-e70235-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8bd3/12278695/d157041ba502/PRO-34-e70235-g005.jpg

相似文献

1
Investigating the determinants of performance in machine learning for protein fitness prediction.研究蛋白质适应性预测机器学习中性能的决定因素。
Protein Sci. 2025 Aug;34(8):e70235. doi: 10.1002/pro.70235.
2
Adaptive gradient scaling: integrating Adam and landscape modification for protein structure prediction.自适应梯度缩放:结合Adam算法与景观修正用于蛋白质结构预测
BMC Bioinformatics. 2025 Jul 1;26(1):161. doi: 10.1186/s12859-025-06185-2.
3
Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果:一种针对特定个体见解的新型验证方法。
Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.
4
An evaluation methodology for machine learning-based tandem mass spectra similarity prediction.一种基于机器学习的串联质谱相似性预测评估方法。
BMC Bioinformatics. 2025 Jul 11;26(1):174. doi: 10.1186/s12859-025-06194-1.
5
A Responsible Framework for Assessing, Selecting, and Explaining Machine Learning Models in Cardiovascular Disease Outcomes Among People With Type 2 Diabetes: Methodology and Validation Study.用于评估、选择和解释2型糖尿病患者心血管疾病结局机器学习模型的责任框架:方法与验证研究
JMIR Med Inform. 2025 Jun 27;13:e66200. doi: 10.2196/66200.
6
Approaches for predicting dairy cattle methane emissions: from traditional methods to machine learning.预测奶牛甲烷排放的方法:从传统方法到机器学习。
J Anim Sci. 2024 Jan 3;102. doi: 10.1093/jas/skae219.
7
Idiographic Lapse Prediction With State Space Modeling: Algorithm Development and Validation Study.基于状态空间模型的个性化失误预测:算法开发与验证研究
JMIR Form Res. 2025 Jun 3;9:e73265. doi: 10.2196/73265.
8
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
9
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
10
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

本文引用的文献

1
Ancestral reconstruction of polyethylene terephthalate degrading cutinases reveals a rugged and unexplored sequence-fitness landscape.聚对苯二甲酸乙二酯降解角质酶的祖先重建揭示了一个崎岖且未被探索的序列-适应性景观。
Sci Adv. 2025 May 16;11(20):eads8318. doi: 10.1126/sciadv.ads8318. Epub 2025 May 14.
2
Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains.蛋白质适应度景观的计算与实验探索:穿越平坦与崎岖地形
Biochemistry. 2025 Apr 15;64(8):1673-1684. doi: 10.1021/acs.biochem.4c00673. Epub 2025 Mar 25.
3
Simulating 500 million years of evolution with a language model.
用语言模型模拟5亿年的进化历程。
Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.
4
The simplicity of protein sequence-function relationships.蛋白质序列与功能关系的简单性。
Nat Commun. 2024 Sep 11;15(1):7953. doi: 10.1038/s41467-024-51895-5.
5
Neural network extrapolation to distant regions of the protein fitness landscape.神经网络对蛋白质适应度景观的遥远区域进行外推。
Nat Commun. 2024 Jul 30;15(1):6405. doi: 10.1038/s41467-024-50712-3.
6
A combinatorially complete epistatic fitness landscape in an enzyme active site.酶活性位点中的组合完全上位适合度景观。
Proc Natl Acad Sci U S A. 2024 Aug 6;121(32):e2400439121. doi: 10.1073/pnas.2400439121. Epub 2024 Jul 29.
7
Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。
Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.
8
Rugged fitness landscapes minimize promiscuity in the evolution of transcriptional repressors.崎岖的适应地形最小化了转录阻遏物进化中的混杂性。
Cell Syst. 2024 Apr 17;15(4):374-387.e6. doi: 10.1016/j.cels.2024.03.002. Epub 2024 Mar 26.
9
Current successes and remaining challenges in protein function prediction.蛋白质功能预测的当前成果与尚存挑战
Front Bioinform. 2023 Jul 27;3:1222182. doi: 10.3389/fbinf.2023.1222182. eCollection 2023.
10
Improving de novo protein binder design with deep learning.利用深度学习改进从头设计的蛋白质结合物。
Nat Commun. 2023 May 6;14(1):2625. doi: 10.1038/s41467-023-38328-5.