Department of Molecular and Human Genetics, Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
PLoS One. 2012;7(11):e47768. doi: 10.1371/journal.pone.0047768. Epub 2012 Nov 21.
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to "phase 3 finished" status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides "lift-over" co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
许多基因组已经使用桑格毛细管电泳和/或更新的短读序列数据和全基因组组装技术进行了高质量的草图测序。然而,即使是最好的草图基因组也包含由于输入数据和构建草图组装的技术的限制而导致的缺口和其他不完美之处。测序偏差、重复的基因组特征、基因组多态性和其他复杂因素共同作用,使得一些区域难以或不可能组装。传统上,使用耗时且昂贵的桑格基于手动完成过程将草图基因组升级到“第三阶段完成”状态。为了更方便地组装和自动完成草图基因组,我们在此提出了一种使用 Pacific Biosciences RS (PacBio) 平台的长读测序进行自动完成的方法。我们的算法和相关软件工具 PBJelly(可在 https://sourceforge.net/projects/pb-jelly/ 上公开获得)使用参考指导组装过程中的长序列读数自动化完成过程。PBJelly 还提供“提升”坐标表,可轻松将现有注释移植到升级的组装中。使用 PBJelly 和长 PacBio 读数,我们升级了模拟果蝇的草图基因组序列、版本 2 草稿果蝇 obscura、2.0 版 budgerigar 数据集的组装以及初步组装的 Sooty mangabey。在 PacBio 长读的 24×映射覆盖率下,我们解决了 99%的缺口,并能够关闭 69%并改进 12%的所有缺口。在 PacBio 长读的 4×映射覆盖率下,我们看到读数解决了我们的 budgerigar 组装中 63%的缺口,其中 32%被关闭,63%得到了改进。在 mangabey PacBio 长读的 6.8×映射覆盖率下,我们解决了 97%的缺口,并关闭了 66%的已解决缺口并改进了 19%。缺口闭合的准确性通过与原始 D. obscura 草图组装的缺口的 Sanger 测序进行比较进行了验证,并显示与初始参考质量有关。