Diabetes Molecular Genetics Section, Phoenix Epidemiology and Clinical Research Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Phoenix, AZ 85004, USA.
Genome Biol Evol. 2024 Sep 3;16(9). doi: 10.1093/gbe/evae188.
There is a collective push to diversify human genetic studies by including underrepresented populations. However, analyzing DNA sequence reads involves the initial step of aligning the reads to the GRCh38/hg38 reference genome which is inadequate for non-European ancestries. In this study, using long-read sequencing technology, we constructed de novo genome assemblies from two indigenous Americans from Arizona (IAZ). Each assembly included ∼17 Mb of DNA sequence not present [nonreference sequence (NRS)] in hg38, which consists mostly of repeat elements. Forty NRSs totaling 240 kb were uniquely anchored to the hg38 primary assembly generating a modified hg38-NRS reference genome. DNA sequence alignment and variant calling were then conducted with whole-genome sequencing (WGS) sequencing data from 387 IAZ using both the hg38 and modified hg38-NRS reference maps. Variant calling with the hg38-NRS map identified ∼50,000 single-nucleotide variants present in at least 5% of the WGS samples which were not detected with the hg38 reference map. We also directly assessed the NRSs positioned within genes. Seventeen NRSs anchored to regions including an identical 187 bp NRS found in both de novo assemblies. The NRS is located in HCN2 79 bp downstream of Exon 3 and contains several putative transcriptional regulatory elements. Genotyping of the HCN2-NRS revealed that the insertion is enriched in IAZ (minor allele frequency = 0.45) compared to other reference populations tested. This study shows that inclusion of population-specific NRSs can dramatically change the variant profile in an underrepresented ethnic groups and thereby lead to the discovery of previously missed common variations.
人们正在推动通过纳入代表性不足的人群来使人类遗传研究多样化。然而,分析 DNA 序列读段涉及将读段与 GRCh38/hg38 参考基因组对齐的初始步骤,而这对于非欧洲血统是不充分的。在这项研究中,我们使用长读测序技术,从亚利桑那州的两个美洲原住民(IAZ)构建了从头基因组组装。每个组装都包含约 17 Mb 不在 hg38 中的 DNA 序列[非参考序列(NRS)],这些序列主要由重复元件组成。40 个 NRS 总计 240 kb 被唯一地锚定到 hg38 主要组装上,生成了一个修改后的 hg38-NRS 参考基因组。然后,我们使用来自 387 个 IAZ 的全基因组测序(WGS)数据,分别使用 hg38 和修改后的 hg38-NRS 参考图谱进行 DNA 序列比对和变异调用。使用 hg38-NRS 图谱进行的变异调用在至少 5%的 WGS 样本中鉴定出了约 50,000 个存在的单核苷酸变体,而使用 hg38 参考图谱则未检测到这些变体。我们还直接评估了定位在基因内的 NRS。有 17 个 NRS 锚定在包括两个从头组装中都发现的 187 bp 相同 NRS 的区域内。该 NRS 位于 HCN2 基因的第 3 外显子下游 79 bp 处,包含几个潜在的转录调控元件。HCN2-NRS 的基因分型表明,与其他测试的参考群体相比,该插入在 IAZ 中富集(次要等位基因频率=0.45)。这项研究表明,纳入特定于群体的 NRS 可以极大地改变代表性不足的族群中的变体谱,从而导致发现以前错过的常见变体。