Department of Bioengineering, Stanford University, Stanford, California 94305, USA;
Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA.
Genome Res. 2023 Oct;33(10):1734-1746. doi: 10.1101/gr.277175.122. Epub 2023 Oct 25.
Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.
尽管在基因组学中普遍存在,但当前的人类参考基因组(GRCh38)并不完整:它缺少大片段异染色质序列,并且作为单一的线性参考基因组,它不能代表人类遗传多样性的全貌。为了描述 GRCh38 和人类遗传多样性中的空白,我们开发了一种使用核家庭进行序列位置近似的算法(ASLAN),以识别无法与 GRCh38 对齐的读取的起源区域。使用未映射的读取和全基因组序列(WGS)的变体调用,ASLAN 使用最大似然模型来识别给定未映射读取中该子序列的分布和家庭的相位,该子序列最有可能属于基因组的区域。在合成数据和诱饵基因组替代单倍型的读取上验证 ASLAN 后,ASLAN 以 >92%的准确率和 ∼1 Mb 的分辨率定位了 >90%的>100bp 序列。然后,我们在超过 700 个家庭的 WGS 的未映射读取的 100-mers 上运行 ASLAN,并将 ASLAN 定位与 100-mers 到最近发布的 T2T-CHM13 组装的对齐进行比较。我们发现,GRCh38 中的许多未映射读取源自 GRCh38 中的空白端粒和着丝粒。ASLAN 定位与 T2T-CHM13 对齐高度一致,除了着丝粒区域的近端着丝粒染色体。比较 ASLAN 定位和 T2T-CHM13 对齐,我们确定了 T2T-CHM13 中缺失的序列或与其对齐区域具有高度差异的序列,突出了遗传多样性的新热点。