Warren René L, Yang Chen, Vandervalk Benjamin P, Behsaz Bahar, Lagman Albert, Jones Steven J M, Birol Inanç
BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada.
Gigascience. 2015 Aug 4;4:35. doi: 10.1186/s13742-015-0076-3. eCollection 2015.
Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value.
We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes.
This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.
由于组装问题的复杂性,我们尚未获得完整的基因组序列。将 reads 组装成完整基因组的困难因序列重复以及短 reads 无法捕获足够的基因组信息来解析这些问题区域而加剧。在这方面,成熟的和新兴的长 reads 技术显示出巨大的潜力,但它们目前较高的错误率通常需要进行计算碱基校正和/或额外的生物信息学预处理才能发挥作用。
我们提出了 LINKS(长间隔核苷酸 k-mer 支架算法),这是一种利用纳米孔序列数据和其他含错误序列数据的序列特性来构建高质量基因组组装体的方法,无需进行 reads 比对或碱基校正。在这里,我们展示了通过使用β版本发布的牛津纳米孔技术有限公司的长 reads,ABySS 大肠杆菌 K-12 基因组组装体的连续性如何能提高五倍以上,以及 LINKS 如何利用酿酒酵母 W303 纳米孔 reads 中的长程信息来产生连续性和正确性与竞争应用相当或更好的组装体。我们还展示了巨大白云杉(Picea glauca)草图组装体(PG29,20 Gbp)的重新支架构建,并证明了 LINKS 如何扩展到更大的基因组。
本研究强调了尽管纳米孔 reads 目前存在局限性,但它们在基因组支架构建中的当前效用,随着纳米孔测序技术的进步,这些局限性预计会减少。我们期望 LINKS 在利用长 reads 的潜力来连接小型和大型基因组组装草图的高质量序列方面具有广泛的用途。