Suppr超能文献

关于适应度函数的稀疏性及其对学习的影响。

On the sparsity of fitness functions and implications for learning.

机构信息

Biophysics Graduate Group, University of California, Berkeley, CA 94720.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

出版信息

Proc Natl Acad Sci U S A. 2022 Jan 4;119(1). doi: 10.1073/pnas.2109649118.

Abstract

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.

摘要

适应度函数将生物序列映射到感兴趣的标量属性上。准确估计这些函数可以提供生物学见解,并为基于模型的序列设计奠定基础。然而,用于学习这些函数的适应度数据集通常相对于序列的大组合空间来说较小;准确估计所需的数据量仍然是一个悬而未决的问题。越来越多的证据表明,当用合子互作来表示时,经验适应度函数表现出显著的稀疏性。此外,压缩感知理论为精确恢复稀疏函数所需的样本数量提供了尺度法则。受这些结果的启发,我们开发了一个框架来研究从 NK 模型的广义(Generalized NK,GNK)中采样的适应度函数的稀疏性,这是一种广泛使用的适应度函数随机场模型。特别是,我们提出了一些结果,可以测试 GNK 模型的可解释参数(序列长度、字母表大小和序列位置之间的假定相互作用)对从模型中采样的适应度函数的稀疏性的影响,以及准确恢复这些函数所需的测量数量。我们通过证明根据结构考虑设置参数的 GNK 模型可以用于准确近似恢复两个经验蛋白质适应度函数和一个 RNA 适应度函数所需的样本数量来验证我们的框架。此外,我们表明,这些 GNK 模型仅使用结构信息就能识别经验适应度函数中的重要高阶合子互作。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验