Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa, Japan.
PLoS Comput Biol. 2022 Apr 4;18(4):e1010016. doi: 10.1371/journal.pcbi.1010016. eCollection 2022 Apr.
Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to go beyond the existing database annotation, it is fundamental to understand the mechanisms underlying functional inheritance and divergence. If the homology relationship between proteins is known, can we determine whether the function diverged? In this work, we analyze different possibilities of protein sequence evolution after gene duplication and identify "inter-paralog inversions", i.e., sites where the relationship between the ancestry and the functional signal is decoupled. The amino acids in these sites are masked from being recognized by other prediction tools. Still, they play a role in functional divergence and could indicate a shift in protein function. We develop a method to specifically recognize inter-paralog amino acid inversions in a phylogeny and test it on real and simulated datasets. In a dataset built from the Epidermal Growth Factor Receptor (EGFR) sequences found in 88 fish species, we identify 19 amino acid sites that went through inversion after gene duplication, mostly located at the ligand-binding extracellular domain. Our work uncovers an outcome of protein duplications with direct implications in protein functional annotation and sequence evolution. The developed method is optimized to work with large protein datasets and can be readily included in a targeted protein analysis pipeline.
将蛋白质序列与功能联系起来变得越来越重要,因为高通量测序研究积累了大量的基因组数据。为了超越现有的数据库注释,理解功能继承和分歧的机制是至关重要的。如果已知蛋白质之间的同源关系,我们能否确定功能是否已经分化?在这项工作中,我们分析了基因复制后蛋白质序列进化的不同可能性,并确定了“基因间倒位”,即祖先和功能信号之间关系分离的位置。这些位置的氨基酸被其他预测工具屏蔽,无法识别,但它们在功能分化中起着作用,并可能表明蛋白质功能的转变。我们开发了一种方法来专门识别系统发育中的基因间氨基酸倒位,并在真实和模拟数据集上进行了测试。在一个由 88 种鱼类中发现的表皮生长因子受体(EGFR)序列构建的数据集上,我们鉴定了 19 个氨基酸位点在基因复制后发生了倒位,这些位点主要位于配体结合的细胞外结构域。我们的工作揭示了蛋白质复制的一种结果,对蛋白质功能注释和序列进化有直接影响。开发的方法经过优化,可用于大型蛋白质数据集,并可轻松纳入靶向蛋白质分析流程。