Lo Chi-Jen, Lin Ting-Fong, Juang Yue-Li, Chen Yi-Cheng
Metabolomics Core Laboratory, Heathy Aging Research Center, Chang Chung University, Taoyuan 333, Taiwan.
Institute of Biomedical Sciences, MacKay Medical University, New Taipei City 250, Taiwan.
Int J Mol Sci. 2025 Sep 16;26(18):9014. doi: 10.3390/ijms26189014.
The GXXXG motif, also called the glycine zipper, is a common sequence pattern that facilitates tight packing of secondary structures, especially through helix-helix interactions in both membrane and soluble proteins. However, its overall distribution, sequence variation, and structural preferences depending on context are not fully understood. Here, we offer a detailed, large-scale analysis of GXXXG motifs, examining over 25,000 unique UniProt sequences with structural data. We classified the motifs as transmembrane (TM), non-transmembrane (non-TM), or shared, based on their TM coverage, and analyzed them via statistical models, diversity measures, and compositional profiling. Our findings show that ≥60% TM coverage is a reliable cutoff to distinguish TM-specific motifs, which tend to have less sequence diversity, lower entropy, more hydrophobic residues (notably leucine, isoleucine, and valine), and rank-frequency distributions that follow a heavy-tailed pattern, indicating strong selective pressure. Conversely, non-TM motifs are more varied, with higher entropy and a preference for polar or flexible residues. Shared motifs have intermediate features, reflecting their functional versatility. Power-law and Zipfian analyses support the distinct statistical signatures of TM and non-TM motifs at the 60% coverage threshold. These results enhance our understanding of the structural and evolutionary roles of the GXXXG motif, setting clear standards for identifying TM-specific motifs and offering insights into membrane protein biology, synthetic design, and functional annotation.
GXXXG基序,也称为甘氨酸拉链,是一种常见的序列模式,有助于二级结构的紧密堆积,特别是通过膜蛋白和可溶性蛋白中的螺旋-螺旋相互作用。然而,其整体分布、序列变异以及取决于上下文的结构偏好尚未完全明确。在此,我们对GXXXG基序进行了详细的大规模分析,研究了超过25000条具有结构数据的独特UniProt序列。我们根据基序的跨膜覆盖情况将其分类为跨膜(TM)、非跨膜(非TM)或共享基序,并通过统计模型、多样性度量和组成分析对其进行分析。我们的研究结果表明,≥60%的跨膜覆盖率是区分跨膜特异性基序的可靠界限,这类基序往往具有较少的序列多样性、较低的熵、更多的疏水残基(特别是亮氨酸、异亮氨酸和缬氨酸),以及遵循重尾模式的秩-频分布,表明存在强大的选择压力。相反,非跨膜基序更加多样,具有较高的熵,并且偏好极性或柔性残基。共享基序具有中间特征,反映了它们功能的多样性。幂律分析和齐普夫分析支持了在60%覆盖率阈值下跨膜和非跨膜基序的不同统计特征。这些结果加深了我们对GXXXG基序的结构和进化作用的理解,为识别跨膜特异性基序设定了明确标准,并为膜蛋白生物学、合成设计和功能注释提供了见解。