Plender Elizabeth G, Prodanov Timofey, Hsieh PingHsun, Nizamis Evangelos, Harvey William T, Sulovari Arvis, Munson Katherine M, Kaufman Eli J, O'Neal Wanda K, Valdmanis Paul N, Marschall Tobias, Bloom Jesse D, Eichler Evan E
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA.
Basic Sciences Division and Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA.
bioRxiv. 2024 Mar 20:2024.03.18.585560. doi: 10.1101/2024.03.18.585560.
The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of and by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.
分泌型粘蛋白MUC5AC和MUC5B通过编码具有可变数目串联重复序列(VNTRs)的大型糖蛋白,在气道病原体截留和黏液纤毛清除中发挥关键的防御作用。这些多态性和简并性的蛋白质编码VNTRs使得这些基因座难以用短读长进行研究。我们通过对206个人类和20个非人类灵长类动物(NHP)单倍型进行长读长测序和组装,来表征MUC5AC和MUC5B的结构多样性。我们发现人类MUC5AC在很大程度上是不变的(5761 - 5762个氨基酸);然而,有七个单倍型具有扩展的VNTRs(6291 - 7019个氨基酸)。相比之下,MUC5B的30个等位基因变体编码16种不同的蛋白质(5249 - 6325个氨基酸),具有富含半胱氨酸的结构域和VNTR拷贝数变异。我们将MUC5B等位基因分为三个系统发育分支:H1(46%,约5654个氨基酸)、H2(33%,约5742个氨基酸)和H3(7%,约6325个氨基酸)。两种最常见的人类MUC5B变体比NHP基因模型小,这表明在近期人类进化过程中蛋白质长度有所缩短。连锁不平衡(LD)和Tajima's D分析表明,东亚人携带异常大的MUC5B LD块,且罕见变异过多(p < 0.05)。为了验证这一结果,我们使用Locityper对来自千人基因组计划的2600个无关样本中的MUC5B单倍型进行基因分型。我们在东亚人的H1和H2中观察到正选择的特征,以及可能的祖先单倍型组(H3)的减少。在非洲人和欧洲人中,H3等位基因显示出过多的常见变异,并且偏离哈迪 - 温伯格平衡,这与杂合子优势和平衡选择一致。这项研究提供了一种可推广的策略,用于表征复杂的蛋白质编码VNTRs,以改善疾病关联研究。