Greenman Kevin P, Amini Ava P, Yang Kevin K
Department of Chemical Engineering, Catholic Institute of Technology, Cambridge, Massachusetts, United States of America.
Department of Chemistry, Catholic Institute of Technology, Cambridge, Massachusetts, United States of America.
PLoS Comput Biol. 2025 Jan 7;21(1):e1012639. doi: 10.1371/journal.pcbi.1012639. eCollection 2025 Jan.
Machine learning sequence-function models for proteins could enable significant advances in protein engineering, especially when paired with state-of-the-art methods to select new sequences for property optimization and/or model improvement. Such methods (Bayesian optimization and active learning) require calibrated estimations of model uncertainty. While studies have benchmarked a variety of deep learning uncertainty quantification (UQ) methods on standard and molecular machine-learning datasets, it is not clear if these results extend to protein datasets. In this work, we implemented a panel of deep learning UQ methods on regression tasks from the Fitness Landscape Inference for Proteins (FLIP) benchmark. We compared results across different degrees of distributional shift using metrics that assess each UQ method's accuracy, calibration, coverage, width, and rank correlation. Additionally, we compared these metrics using one-hot encoding and pretrained language model representations, and we tested the UQ methods in retrospective active learning and Bayesian optimization settings. Our results indicate that there is no single best UQ method across all datasets, splits, and metrics, and that uncertainty-based sampling is often unable to outperform greedy sampling in Bayesian optimization. These benchmarks enable us to provide recommendations for more effective design of biological sequences using machine learning.
用于蛋白质的机器学习序列-功能模型能够推动蛋白质工程取得重大进展,尤其是与最先进的方法相结合来选择新序列以优化特性和/或改进模型时。此类方法(贝叶斯优化和主动学习)需要对模型不确定性进行校准估计。虽然已有研究在标准和分子机器学习数据集上对多种深度学习不确定性量化(UQ)方法进行了基准测试,但尚不清楚这些结果是否适用于蛋白质数据集。在这项工作中,我们在蛋白质适应度景观推断(FLIP)基准的回归任务上实现了一组深度学习UQ方法。我们使用评估每种UQ方法的准确性、校准、覆盖率、宽度和秩相关性的指标,比较了不同程度分布偏移情况下的结果。此外,我们使用独热编码和预训练语言模型表示比较了这些指标,并在回顾性主动学习和贝叶斯优化设置中测试了UQ方法。我们的结果表明,在所有数据集、划分和指标中不存在单一的最佳UQ方法,并且基于不确定性的采样在贝叶斯优化中通常无法优于贪婪采样。这些基准使我们能够为使用机器学习更有效地设计生物序列提供建议。