对GXXXG模体的综合分析揭示了蛋白质中结构背景依赖性的多样性和组成。

Comprehensive Analysis of the GXXXG Motif Reveals Structural Context-Dependent Diversity and Composition Across Proteins.

作者信息

Lo Chi-Jen, Lin Ting-Fong, Juang Yue-Li, Chen Yi-Cheng

机构信息

Metabolomics Core Laboratory, Heathy Aging Research Center, Chang Chung University, Taoyuan 333, Taiwan.

Institute of Biomedical Sciences, MacKay Medical University, New Taipei City 250, Taiwan.

出版信息

Int J Mol Sci. 2025 Sep 16;26(18):9014. doi: 10.3390/ijms26189014.

DOI:10.3390/ijms26189014

PMID:41009580

Abstract

The GXXXG motif, also called the glycine zipper, is a common sequence pattern that facilitates tight packing of secondary structures, especially through helix-helix interactions in both membrane and soluble proteins. However, its overall distribution, sequence variation, and structural preferences depending on context are not fully understood. Here, we offer a detailed, large-scale analysis of GXXXG motifs, examining over 25,000 unique UniProt sequences with structural data. We classified the motifs as transmembrane (TM), non-transmembrane (non-TM), or shared, based on their TM coverage, and analyzed them via statistical models, diversity measures, and compositional profiling. Our findings show that ≥60% TM coverage is a reliable cutoff to distinguish TM-specific motifs, which tend to have less sequence diversity, lower entropy, more hydrophobic residues (notably leucine, isoleucine, and valine), and rank-frequency distributions that follow a heavy-tailed pattern, indicating strong selective pressure. Conversely, non-TM motifs are more varied, with higher entropy and a preference for polar or flexible residues. Shared motifs have intermediate features, reflecting their functional versatility. Power-law and Zipfian analyses support the distinct statistical signatures of TM and non-TM motifs at the 60% coverage threshold. These results enhance our understanding of the structural and evolutionary roles of the GXXXG motif, setting clear standards for identifying TM-specific motifs and offering insights into membrane protein biology, synthetic design, and functional annotation.

摘要

GXXXG基序，也称为甘氨酸拉链，是一种常见的序列模式，有助于二级结构的紧密堆积，特别是通过膜蛋白和可溶性蛋白中的螺旋-螺旋相互作用。然而，其整体分布、序列变异以及取决于上下文的结构偏好尚未完全明确。在此，我们对GXXXG基序进行了详细的大规模分析，研究了超过25000条具有结构数据的独特UniProt序列。我们根据基序的跨膜覆盖情况将其分类为跨膜（TM）、非跨膜（非TM）或共享基序，并通过统计模型、多样性度量和组成分析对其进行分析。我们的研究结果表明，≥60%的跨膜覆盖率是区分跨膜特异性基序的可靠界限，这类基序往往具有较少的序列多样性、较低的熵、更多的疏水残基（特别是亮氨酸、异亮氨酸和缬氨酸），以及遵循重尾模式的秩-频分布，表明存在强大的选择压力。相反，非跨膜基序更加多样，具有较高的熵，并且偏好极性或柔性残基。共享基序具有中间特征，反映了它们功能的多样性。幂律分析和齐普夫分析支持了在60%覆盖率阈值下跨膜和非跨膜基序的不同统计特征。这些结果加深了我们对GXXXG基序的结构和进化作用的理解，为识别跨膜特异性基序设定了明确标准，并为膜蛋白生物学、合成设计和功能注释提供了见解。