Department of Clinical Genetics, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, The Netherlands.
Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.
Transl Psychiatry. 2020 Nov 2;10(1):369. doi: 10.1038/s41398-020-01060-5.
The human genome harbors numerous structural variants (SVs) which, due to their repetitive nature, are currently underexplored in short-read whole-genome sequencing approaches. Using single-molecule, real-time (SMRT) long-read sequencing technology in combination with FALCON-Unzip, we generated a de novo assembly of the diploid genome of a 115-year-old Dutch cognitively healthy woman. We combined this assembly with two previously published haploid assemblies (CHM1 and CHM13) and the GRCh38 reference genome to create a compendium of SVs that occur across five independent human haplotypes using the graph-based multi-genome aligner REVEAL. Across these five haplotypes, we detected 31,680 euchromatic SVs (>50 bp). Of these, ~62% were comprised of repetitive sequences with 'variable number tandem repeats' (VNTRs), ~10% were mobile elements (Alu, L1, and SVA), while the remaining variants were inversions and indels. We observed that VNTRs with GC-content >60% and repeat patterns longer than 15 bp were 21-fold enriched in the subtelomeric regions (within 5 Mb of the ends of chromosome arms). VNTR lengths can expand to exceed a critical length which is associated with impaired gene transcription. The genes that contained most VNTRs, of which PTPRN2 and DLGAP2 are the most prominent examples, were found to be predominantly expressed in the brain and associated with a wide variety of neurological disorders. Repeat-induced variation represents a sizeable fraction of the genetic variation in human genomes and should be included in investigations of genetic factors associated with phenotypic traits, specifically those associated with neurological disorders. We make available the long and short-read sequence data of the supercentenarian genome, and a compendium of SVs as identified across 5 human haplotypes.
人类基因组中存在大量结构变异(SV),由于其重复性质,目前在短读长全基因组测序方法中尚未得到充分探索。我们使用单分子实时(SMRT)长读测序技术结合 FALCON-Unzip,对一位 115 岁的荷兰认知健康女性的二倍体基因组进行了从头组装。我们将该组装与之前发表的两个单倍体组装(CHM1 和 CHM13)和 GRCh38 参考基因组结合起来,使用基于图形的多基因组比对器 REVEAL,创建了一个跨越五个独立人类单倍型的 SV 简编。在这五个单倍型中,我们检测到了 31680 个常染色质 SV(>50bp)。其中,约 62%由具有“可变数量串联重复”(VNTR)的重复序列组成,约 10%为移动元件(Alu、L1 和 SVA),其余变体为倒位和插入缺失。我们观察到,GC 含量>60%且重复模式超过 15bp 的 VNTR 在端粒区域(染色体臂末端 5Mb 内)富集了 21 倍。VNTR 长度可以扩展到超过与基因转录受损相关的临界长度。包含最多 VNTR 的基因,其中 PTPRN2 和 DLGAP2 是最突出的例子,被发现主要在大脑中表达,并与多种神经疾病有关。重复诱导的变异代表人类基因组遗传变异的相当大一部分,应包括在与表型特征相关的遗传因素研究中,特别是与神经疾病相关的因素。我们提供了超级百岁老人基因组的长读和短读序列数据,以及跨越 5 个人类单倍型的 SV 简编。