研究蛋白质适应性预测机器学习中性能的决定因素。

Investigating the determinants of performance in machine learning for protein fitness prediction.

作者信息

Sandhu Mahakaran, Mater Adam C, Matthews Dana S, Spence Matthew A, Lenskiy Artem A, Jackson Colin

机构信息

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.

ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.

出版信息

Protein Sci. 2025 Aug;34(8):e70235. doi: 10.1002/pro.70235.

DOI:10.1002/pro.70235

PMID:40689706

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12278695/

Abstract

Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.

摘要

机器学习（ML）彻底改变了蛋白质生物学，解决了蛋白质折叠、支架生成和功能设计任务中长期存在的问题。一系列架构在监督式蛋白质适应性预测任务中已取得成功。然而，由于缺乏合理的方法来评估哪种架构最适合特定数据集和工程任务，架构选择仍然具有挑战性。在此，我们提出了一个框架，用于研究一系列ML架构成功的决定因素。使用模拟（NK模型）和经验性适应性景观，我们沿着六个关键性能指标来衡量序列适应性预测：训练域内的插值、训练域外的外推、对上位性增加/崎岖度的鲁棒性、进行位置外推的能力、对稀疏训练数据的鲁棒性以及对序列长度的敏感性。我们表明，算法之间的架构差异在实验和理论景观中始终会影响这些指标的性能。此外，景观崎岖度成为序列适应性预测准确性的主要决定因素。我们的方法和结果为基于适应性景观理论的实验数据采样、模型选择和评估提供了一种合理策略，我们希望这种策略将提高序列适应性预测的准确性，并对蛋白质工程和变异功能预测产生影响。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

研究蛋白质适应性预测机器学习中性能的决定因素。

Investigating the determinants of performance in machine learning for protein fitness prediction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

研究蛋白质适应性预测机器学习中性能的决定因素。

Investigating the determinants of performance in machine learning for protein fitness prediction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献