Dishuck Philip C, Munson Katherine M, Lewis Alexandra P, Dougherty Max L, Underwood Jason G, Harvey William T, Hsieh PingHsun, Pastinen Tomi, Eichler Evan E
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
Present address: Tisch Cancer Institute, Division of Hematology and Medical Oncology, The Icahn School of Medicine at Mount Sinai, New York, NY, USA.
bioRxiv. 2025 Feb 5:2025.02.04.636496. doi: 10.1101/2025.02.04.636496.
The (nuclear pore interacting protein) gene family has expanded to high copy number in humans and African apes where it has been subject to an excess of amino acid replacement consistent with positive selection (1). Due to the limitations of short-read sequencing, human genetic diversity has been poorly understood. Using highly accurate assemblies generated from long-read sequencing as part of the human pangenome, we completely characterize 169 human haplotypes (4,665 paralogs and alleles). Of the 28 paralogs, just three (, , and ) are fixed at a single copy, and only a single locus, , shows no structural variation. Four paralogs map to large segmental duplication blocks that mediate polymorphic inversions (355 kbp-1.6 Mbp) corresponding to microdeletions associated with developmental delay and autism. Haplotype-based tests of positive selection and selective sweeps identify two paralogs, and , within the top percentile for both tests. Using full-length cDNA data from 101 tissue/cell types, we construct paralog-specific gene models and show that 56% (31/55 most abundant isoforms) have not been previously described in RefSeq. We define six distinct translation start sites and other protein structural features that distinguish paralogs, including a variable number tandem repeat that encodes a beta helix of variable size that emerged ~3.1 million years ago in human evolution. Among the 28 paralogs, we identify distinct tissue and developmental patterns of expression with only a few maintaining the ancestral testis-enriched expression. A subset of paralogs (, , , , and ) show increased brain expression. Our results suggest ongoing positive selection in the human population and rapid diversification of gene models.
(核孔相互作用蛋白)基因家族在人类和非洲猿中扩增至高拷贝数,在那里它经历了与正选择一致的过量氨基酸替换(1)。由于短读长测序的局限性,人类遗传多样性一直未得到充分了解。利用作为人类泛基因组一部分的长读长测序生成的高精度组装序列,我们完整地表征了169个人类单倍型(4665个旁系同源基因和等位基因)。在28个旁系同源基因中,只有三个(、和)以单拷贝固定,并且只有一个位点()没有结构变异。四个旁系同源基因映射到介导多态性倒位(355 kbp - 1.6 Mbp)的大片段重复区域,这些倒位与发育迟缓及自闭症相关的微缺失相对应。基于单倍型的正选择和选择性清除测试确定了两个旁系同源基因(和)在两项测试中均处于前百分之一。利用来自101种组织/细胞类型的全长cDNA数据,我们构建了旁系同源基因特异性基因模型,并表明56%(55种最丰富的异构体中的31种)在RefSeq中未曾被描述过。我们定义了六个不同的翻译起始位点和其他区分旁系同源基因的蛋白质结构特征,包括一个可变数量串联重复序列,它编码一个大小可变的β螺旋,该螺旋在人类进化过程中大约310万年前出现。在28个旁系同源基因中,我们确定了不同的组织和发育表达模式,只有少数保持了祖先在睾丸中富集的表达。一部分旁系同源基因(、、、和)在大脑中的表达增加。我们的结果表明人类群体中正在进行正选择,并且基因模型快速多样化。