Steinberg Karyn Meltz, Schneider Valerie A, Graves-Lindsay Tina A, Fulton Robert S, Agarwala Richa, Huddleston John, Shiryev Sergey A, Morgulis Aleksandr, Surti Urvashi, Warren Wesley C, Church Deanna M, Eichler Evan E, Wilson Richard K
The Genome Institute at Washington University, St. Louis, Missouri 63108, USA;
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
Genome Res. 2014 Dec;24(12):2066-76. doi: 10.1101/gr.180893.114. Epub 2014 Nov 4.
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.
完整的参考基因组组装对于准确解读个体基因组以及将变异与表型关联起来至关重要。虽然当前的人类参考基因组序列质量非常高,但由于生物学和技术复杂性,仍存在缺口和错误组装。大型重复序列和复杂的等位基因多样性是组装错误的两个主要驱动因素。尽管增加测序读长和文库片段长度可以改善组装效果,但即使是最长的可用读长也无法解析所有区域。为了克服等位基因多样性问题,我们使用了来自基本上为单倍体的葡萄胎CHM1的基因组DNA。我们利用了该DNA的多种资源,包括一组末端测序和索引的BAC克隆以及100×的Illumina全基因组鸟枪法(WGS)序列覆盖。我们使用WGS序列和GRCh37参考基因组组装来创建CHM1基因组的组装。随后,我们纳入了382个完成的BAC克隆序列以生成一个草图组装,即CHM1_1.1(NCBI AssemblyDB GCA_000306695.2)。对基因、重复元件和片段重复内容的分析表明该组装具有优异的质量和连续性。然而,与不依赖组装的资源(如BAC克隆末端序列和PacBio长读长)进行比较时,发现存在错误组装区域。这些区域中的大多数富含结构变异和片段重复,未来可以得到解决。这个公开可用的组装将被整合到基因组参考联盟的管理框架中以进一步改进,最终目标是得到一个完全完成且无缺口的组装。