Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae093.
A fundamental goal in evolutionary biology and population genetics is to understand how selection shapes the fate of new mutations. Here, we test the null hypothesis that insertion-deletion (indel) events in protein-coding regions occur randomly with respect to secondary structures. We identified indels across 11,444 sequence alignments in mouse, rat, human, chimp, and dog genomes and then quantified their overlap with four different types of secondary structure-alpha helices, beta strands, protein bends, and protein turns-predicted by deep-learning methods of AlphaFold2. Indels overlapped secondary structures 54% as much as expected and were especially underrepresented over beta strands, which tend to form internal, stable regions of proteins. In contrast, indels were enriched by 155% over regions without any predicted secondary structures. These skews were stronger in the rodent lineages compared to the primate lineages, consistent with population genetic theory predicting that natural selection will be more efficient in species with larger effective population sizes. Nonsynonymous substitutions were also less common in regions of protein secondary structure, although not as strongly reduced as in indels. In a complementary analysis of thousands of human genomes, we showed that indels overlapping secondary structure segregated at significantly lower frequency than indels outside of secondary structure. Taken together, our study shows that indels are selected against if they overlap secondary structure, presumably because they disrupt the tertiary structure and function of a protein.
进化生物学和群体遗传学的一个基本目标是了解选择如何塑造新突变的命运。在这里,我们检验了一个零假设,即在蛋白质编码区域发生的插入缺失(indel)事件相对于二级结构是随机发生的。我们在小鼠、大鼠、人类、黑猩猩和狗的基因组中识别了 11444 个序列比对中的 indel,然后定量分析了它们与由 AlphaFold2 的深度学习方法预测的四种不同类型的二级结构——α螺旋、β链、蛋白弯曲和蛋白转折——的重叠。indel 与二级结构的重叠程度比预期的要高 54%,特别是在β链上的重叠程度较低,β链往往形成蛋白质的内部稳定区域。相比之下,indel 在没有预测到的二级结构的区域中富集了 155%。这些偏斜在啮齿动物谱系中比灵长类动物谱系更强,这与预测自然选择在有效种群规模较大的物种中更有效的群体遗传理论一致。虽然不像 indel 那样强烈减少,但非 synonymous 替换在蛋白质二级结构区域也不太常见。在对数千个人类基因组的补充分析中,我们表明,与二级结构之外的 indel 相比,重叠二级结构的 indel 分离频率显著降低。总之,我们的研究表明,如果 indel 与二级结构重叠,它们将受到选择压力,可能是因为它们破坏了蛋白质的三级结构和功能。