关于蛋白质遗传结构的分析：对《蛋白质序列格局并非如此简单》的回应

On the Analysis of Protein Genetic Architecture: Response to "Protein sequence landscapes are not so simple".

作者信息

Park Yeonwoo, Metzger Brian P H, Thornton Joseph W

机构信息

Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA.

Current affiliation: Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea.

出版信息

bioRxiv. 2024 Dec 6:2024.09.17.613512. doi: 10.1101/2024.09.17.613512.

DOI:10.1101/2024.09.17.613512

PMID:39677708

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11642768/

Abstract

We recently reanalyzed 20 combinatorial mutagenesis datasets using a novel reference-free analysis (RFA) method and showed that high-order epistasis contributes negligibly to protein sequence-function relationships in every case. Dupic, Phillips, and Desai (DPD) commented on a preprint of our work. In our published paper, we addressed all the major issues they raised, but we respond directly to them here. 1) DPD's claim that RFA is equivalent to estimating reference-based analysis (RBA) models by regression neglects fundamental differences in how the two formalisms dissect the causal relationship between sequence and function. It also misinterprets the observation that using regression to estimate any truncated model of genetic architecture will always yield the same predicted phenotypes and variance partition; the resulting estimates correspond to those of the RFA formalism but are inaccurate representations of the true RBA model. 2) DPD's claim that high-order epistasis is widespread and significant while somehow explaining little phenotypic variance is an artifact of two strong biases in the use of regression to estimate RBA models: this procedure underestimates the phenotypic variance explained by RBA epistatic terms while at the same time inflating the magnitude of individual terms. 3) DPD erroneously claim that RFA is "exactly equivalent" to Fourier analysis (FA) and background-averaged analysis (BA). This error arises because DPD used an incorrect mathematical definition of RFA and were misled by a simple numerical relationship among the models that only holds only for the simplest kinds of datasets. 4) DPD argue that using a nonlinear transformation to account for global nonlinearities in sequence-function relationships is often unnecessary and may artifactually absorb specific epistatic interactions. We show that nonspecific epistasis caused by a limited dynamic range affects datasets of all types, even when the phenotype is represented on a free-energy scale. Moreover, using a nonlinear transformation in a joint fitting procedure does not underestimate specific epistasis under realistic conditions, even if the data are not affected by nonspecific epistasis. The conclusions of our work therefore hold: the genetic architecture of all 20 protein datasets we analyzed can be efficiently and accurately described in an RFA framework by first-order amino acid effects and pairwise interactions with a simple model of global nonlinearity. We are grateful for DPD's commentary, which helped us improve our paper.

摘要

我们最近使用一种新颖的无参考分析（RFA）方法重新分析了20个组合诱变数据集，并表明在每种情况下高阶上位性对蛋白质序列-功能关系的贡献可忽略不计。杜皮克、菲利普斯和德赛（DPD）对我们工作的一篇预印本发表了评论。在我们已发表的论文中，我们解决了他们提出的所有主要问题，但在此直接回应他们。1）DPD声称RFA等同于通过回归估计基于参考的分析（RBA）模型，这忽略了两种形式体系剖析序列与功能之间因果关系方式的根本差异。这也误解了这样一种观察结果，即使用回归来估计遗传结构的任何截断模型总是会产生相同的预测表型和方差划分；所得估计对应于RFA形式体系的估计，但并非真实RBA模型的准确表示。2）DPD声称高阶上位性广泛且显著，但在某种程度上解释的表型方差却很少，这是在使用回归估计RBA模型时两种强烈偏差造成的假象：此过程低估了RBA上位性项所解释的表型方差，同时又夸大了单个项的大小。3）DPD错误地声称RFA与傅里叶分析（FA）和背景平均分析（BA）“完全等效”。这个错误的产生是因为DPD使用了RFA的错误数学定义，并被模型之间仅对最简单类型的数据集成立的简单数值关系误导。4）DPD认为使用非线性变换来考虑序列-功能关系中的全局非线性通常是不必要的，并且可能会人为地吸收特定的上位性相互作用。我们表明，即使表型以自由能尺度表示，由有限动态范围引起的非特异性上位性也会影响所有类型的数据集。此外，在联合拟合过程中使用非线性变换在现实条件下不会低估特定上位性，即使数据不受非特异性上位性影响。因此，我们工作的结论仍然成立：通过一阶氨基酸效应以及与简单全局非线性模型的成对相互作用，我们分析的所有20个蛋白质数据集的遗传结构都可以在RFA框架中得到有效且准确的描述。我们感谢DPD的评论，它帮助我们改进了论文。