Park Yeonwoo, Metzger Brian P H, Thornton Joseph W
Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL 60637.
Current affiliation: Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea 08826.
bioRxiv. 2024 Feb 7:2023.09.02.556057. doi: 10.1101/2023.09.02.556057.
How complicated is the genetic architecture of proteins - the set of causal effects by which sequence determines function? High-order epistatic interactions among residues are thought to be pervasive, making a protein's function difficult to predict or understand from its sequence. Most studies, however, used methods that overestimate epistasis, because they analyze genetic architecture relative to a designated reference sequence - causing measurement noise and small local idiosyncrasies to propagate into pervasive high-order interactions - or have not effectively accounted for global nonlinearity in the sequence-function relationship. Here we present a new reference-free method that jointly estimates global nonlinearity and specific epistatic interactions across a protein's entire genotype-phenotype map. This method yields a maximally efficient explanation of a protein's genetic architecture and is more robust than existing methods to measurement noise, partial sampling, and model misspecification. We reanalyze 20 combinatorial mutagenesis experiments from a diverse set of proteins and find that additive and pairwise effects, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of total variance in measured phenotypes (and >92% in every case). Only a tiny fraction of genotypes are strongly affected by third- or higher-order epistasis. Genetic architecture is also sparse: the number of terms required to explain the vast majority of variance is smaller than the number of genotypes by many orders of magnitude. The sequence-function relationship in most proteins is therefore far simpler than previously thought, opening the way for new and tractable approaches to characterize it.
蛋白质的遗传结构——即序列决定功能的一系列因果效应——有多复杂?残基之间的高阶上位性相互作用被认为普遍存在,这使得从蛋白质序列预测或理解其功能变得困难。然而,大多数研究使用的方法高估了上位性,因为它们相对于指定的参考序列分析遗传结构——导致测量噪声和局部小特性传播到普遍存在的高阶相互作用中——或者没有有效考虑序列-功能关系中的全局非线性。在这里,我们提出了一种新的无参考方法,该方法可以联合估计蛋白质整个基因型-表型图谱中的全局非线性和特定的上位性相互作用。这种方法能对蛋白质的遗传结构给出最大程度有效的解释,并且比现有方法对测量噪声、部分采样和模型错误指定更具鲁棒性。我们重新分析了来自多种蛋白质的20个组合诱变实验,发现加性效应和成对效应,以及用于解释有限动态范围的简单非线性,解释了测量表型中总方差的中位数为96%(在每种情况下均>92%)。只有极小一部分基因型受到三阶或更高阶上位性的强烈影响。遗传结构也是稀疏的:解释绝大多数方差所需的项数比基因型数小多个数量级。因此,大多数蛋白质中的序列-功能关系比以前认为的要简单得多,为表征它的新的且易于处理的方法开辟了道路。