Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA.
Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea.
Nat Commun. 2024 Sep 11;15(1):7953. doi: 10.1038/s41467-024-51895-5.
How complex are the rules by which a protein's sequence determines its function? High-order epistatic interactions among residues are thought to be pervasive, suggesting an idiosyncratic and unpredictable sequence-function relationship. But many prior studies may have overestimated epistasis, because they analyzed sequence-function relationships relative to a single reference sequence-which causes measurement noise and local idiosyncrasies to snowball into high-order epistasis-or they did not fully account for global nonlinearities. Here we present a reference-free method that jointly infers specific epistatic interactions and global nonlinearity using a bird's-eye view of sequence space. This technique yields the simplest explanation of sequence-function relationships and is more robust than existing methods to measurement noise, missing data, and model misspecification. We reanalyze 20 experimental datasets and find that context-independent amino acid effects and pairwise interactions, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of phenotypic variance and over 92% in every case. Only a tiny fraction of genotypes are strongly affected by higher-order epistasis. Sequence-function relationships are also sparse: a miniscule fraction of amino acids and interactions account for 90% of phenotypic variance. Sequence-function causality across these datasets is therefore simple, opening the way for tractable approaches to characterize proteins' genetic architecture.
蛋白质序列决定其功能的规则有多复杂?残基之间的高阶上位性相互作用被认为是普遍存在的,这表明序列-功能关系是特殊的且不可预测的。但许多先前的研究可能高估了上位性,因为它们相对于单个参考序列来分析序列-功能关系,这会导致测量噪声和局部特殊性累积成高阶上位性,或者它们没有充分考虑全局非线性。在这里,我们提出了一种无参考的方法,该方法使用序列空间的鸟瞰图来共同推断特定的上位性相互作用和全局非线性。该技术提供了序列-功能关系的最简单解释,并且比现有的方法更能抵抗测量噪声、缺失数据和模型失拟。我们重新分析了 20 个实验数据集,发现与上下文无关的氨基酸效应和成对相互作用,以及简单的非线性来解释有限的动态范围,可以解释中位数为 96%的表型方差,在每种情况下都超过 92%。只有一小部分基因型受到高阶上位性的强烈影响。序列-功能关系也是稀疏的:极少数的氨基酸和相互作用占表型方差的 90%。因此,这些数据集的序列-功能因果关系很简单,为可处理的方法来描述蛋白质的遗传结构开辟了道路。