Bickhart Derek M, Rosen Benjamin D, Koren Sergey, Sayre Brian L, Hastie Alex R, Chan Saki, Lee Joyce, Lam Ernest T, Liachko Ivan, Sullivan Shawn T, Burton Joshua N, Huson Heather J, Nystrom John C, Kelley Christy M, Hutchison Jana L, Zhou Yang, Sun Jiajie, Crisà Alessandra, Ponce de León F Abel, Schwartz John C, Hammond John A, Waldbieser Geoffrey C, Schroeder Steven G, Liu George E, Dunham Maitreya J, Shendure Jay, Sonstegard Tad S, Phillippy Adam M, Van Tassell Curtis P, Smith Timothy P L
Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, Wisconsin, USA.
Animal Genomics and Improvement Laboratory, ARS USDA, Beltsville, Maryland, USA.
Nat Genet. 2017 Apr;49(4):643-650. doi: 10.1038/ng.3802. Epub 2017 Mar 6.
The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ∼400-fold improvement in continuity due to properly assembled gaps, compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex yet produced for an individual of a ruminant species.
短读长平台测序成本的降低以及组装算法复杂性的提高,使得拥有基因组组装的物种数量急剧增加。然而,这些组装结果高度碎片化,存在许多缺口、模糊性和错误,阻碍了下游应用。我们展示了基于长读长进行重叠群构建、短读长进行一致性验证以及通过光学和染色质相互作用图谱进行支架搭建的家山羊(Capra hircus)从头组装的当前技术水平。这些组合技术产生了据我们所知迄今为止最连续的哺乳动物从头组装结果,具有染色体长度的支架且仅有649个缺口。与之前发表的家山羊组装结果相比,由于缺口得到了恰当组装,我们的组装结果在连续性上有了约400倍的提升,并且能更好地解析长度超过1 kb的重复结构,代表了反刍动物个体中迄今产生的最大重复家族和免疫基因复合体。