Miller Brian J
Biologic Institute, Redmond, WA, United States of America.
PLoS One. 2024 Dec 5;19(12):e0314929. doi: 10.1371/journal.pone.0314929. eCollection 2024.
A key question in protein evolution and protein engineering is the prevalence of evolutionary paths between distinct proteins. An evolutionary path corresponds to a continuous path of functional sequences in sequence space leading from one protein to another. Natural selection could direct a mutating coding region in DNA along a continuous functional path (CFP), so a new protein could arise far more easily than if a coding region were randomly mutating without any constraints. The distribution and length of CFPs undergird theories on the origin of natural proteins and strategies for engineering artificial proteins. This study examined the distribution of long CFPs within the framework of percolation theory, which addresses the proportion of randomly filled sites in a lattice above which long continuous paths of neighboring filled sites become common (aka percolation threshold). It also used a simulation to demonstrate that the percolation threshold in protein sequence space approximates the reciprocal of the average number of protein variants that could result from a single mutation. For diverse proteins, the ratio was calculated between the percolation threshold and the proportion of sequences reported to perform a protein's function, relative to the total number of sequences of that protein's length. This ratio represents a measure of the biasing in the distribution of functional sequences required for evolutionary paths to possibly exist, so it provides a means to quantify the specificity in protein sequence and structure required to allow for a protein to develop new catalytic functions. The consistently high ratio demonstrates that CFPs can only connect distinct proteins if the biasing in the distribution of functional sequences in sequence space is often extremely large. Regions in sequence space are identified where the biasing is sufficient to allow for extensive CFPs. The calculated levels of required biasing and the identified regions of high biasing reinforce the conclusion of previous studies that some proteins are highly optimized, so mutations can enable or enhance catalytic functions while maintaining the protein's structure. The conclusions of this study also challenge the results of a previous application of percolation theory to sequence space that did not properly incorporate the percolation threshold. Steps are outlined for integrating the percolation threshold and the biasing measure into studies of protein sequence space.
蛋白质进化和蛋白质工程中的一个关键问题是不同蛋白质之间进化路径的普遍性。进化路径对应于序列空间中从一种蛋白质通向另一种蛋白质的功能序列的连续路径。自然选择可以引导DNA中不断突变的编码区域沿着连续功能路径(CFP)发展,因此新蛋白质的出现可能比编码区域在没有任何限制的情况下随机突变要容易得多。CFP的分布和长度支撑着关于天然蛋白质起源的理论以及人工蛋白质工程策略。本研究在渗流理论框架内研究了长CFP的分布,渗流理论解决的是晶格中随机填充位点的比例,超过该比例后相邻填充位点的长连续路径变得常见(即渗流阈值)。研究还通过模拟证明,蛋白质序列空间中的渗流阈值近似于单个突变可能产生的蛋白质变体平均数量的倒数。对于多种蛋白质,计算了渗流阈值与报告执行蛋白质功能的序列比例之间的比值,该比例相对于该蛋白质长度的序列总数。这个比值代表了进化路径可能存在所需的功能序列分布偏差的一种度量,因此它提供了一种量化蛋白质序列和结构特异性的方法,以允许蛋白质发展新的催化功能。始终很高的比值表明,只有当序列空间中功能序列分布的偏差通常极大时,CFP才能连接不同的蛋白质。确定了序列空间中偏差足以允许广泛CFP的区域。计算出的所需偏差水平和确定的高偏差区域强化了先前研究的结论,即一些蛋白质经过高度优化,因此突变可以在维持蛋白质结构的同时实现或增强催化功能。本研究的结论也对先前将渗流理论应用于序列空间但未正确纳入渗流阈值的结果提出了挑战。概述了将渗流阈值和偏差度量纳入蛋白质序列空间研究的步骤。