Kim Jina, Sung Joohon, Han Kyudong, Lee Wooseok, Mun Seyoung, Lee Jooyeon, Bahk Kunhyung, Yang Inchul, Bae Young-Kyung, Kim Changhoon, Kim Jong-Il, Seo Jeong-Sun
Interdisciplinary Program of Bioinformatics, College of Natural Science, Seoul National University, Seoul 08826, Korea.
Genome & Health Big Data Laboratory, Department of Health Science, Seoul National University, Seoul 08826, Korea.
Genes (Basel). 2020 Nov 13;11(11):1350. doi: 10.3390/genes11111350.
The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored 1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the "unmapped" (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the "unmapped reads", which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.
当前的人类参考基因组(GRCh38),凭借其卓越的质量,对基因组分析做出了重大贡献。然而,GRCh38可能仍无法充分代表各民族基因组,尤其是亚洲人的基因组,尽管我们具体缺失的部分仍不清楚。在此,我们将GRCh38与一个韩国人(AK1)的高连续性基因组组装进行比对,以表明GRCh38中缺失了一部分AK1基因组,且这些缺失区域含有约1390个推定的编码元件。此外,当我们分析14个人(5个东亚人、4个欧洲人和5个非洲人)的“未映射”(到GRCh38) reads时,发现多个群体在缺失的基因组中共享了某些特定部分,总计约占基因组总区域的5.3 Mb(约占AK1的0.2%)。从“未映射reads”中恢复的AK1区域,即GRCh38中不存在的估计缺失区域,含有候选编码元件。我们验证了大多数常见的(≥7个人共享)缺失区域存在于人类和黑猩猩的DNA中。此外,我们进一步确定了常见缺失区域的发生机制、民族异质性以及存在情况。这项研究揭示了使用泛基因组参考的潜在优势,并提出有必要对GRCh38中全球缺失区域的各种特征进行进一步研究。