Stadermann Kai Bernd, Weisshaar Bernd, Holtgräwe Daniela
Chair of Genome Research, Faculty of Biology, Bielefeld University, Bielefeld, Germany.
Bioinformatics Resource Facility, Centre for Biotechnology, Bielefeld University, Bielefeld, Germany.
BMC Bioinformatics. 2015 Sep 16;16(1):295. doi: 10.1186/s12859-015-0726-6.
Third generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base.
We describe a workflow for a de novo assembly of the sugar beet (Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly.
SMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly which we could demonstrate with Fosmid End Sequences (FES) generated with Sanger technology. Nevertheless, this limitation also applies to short read sequencing data but is reached in this case at a much earlier stage during finishing.
第三代测序方法,如太平洋生物科学公司开发的单分子实时(SMRT)测序,与下一代测序(NGS)方法相比,读长要长得多。因此,它们非常适合从头测序或重测序项目。为这些目的生成的序列不仅会包含来自核基因组的读段,还会包含大量来自目标生物体细胞器的读段。这些读段通常会被丢弃,但它们也可用于细胞器复制子的组装。长读长有助于解决细胞器基因组中的重复区域和重复序列问题,而仅使用短读数据时这些问题可能会很棘手。此外,SMRT测序受富含GC区域和相同碱基长片段的影响较小。
我们描述了一种仅基于针对甜菜(Beta vulgaris ssp. vulgaris)核基因组的SMRT测序数据集的数据,从头组装其叶绿体基因组序列的工作流程。我们表明,从这样一个实验中获得的数据足以创建一个高质量的组装体,其可靠性高于仅从例如Illumina读段衍生的组装体。叶绿体基因组对于从头组装尤其具有挑战性,因为它包含两个大的反向重复(IR)区域。我们还描述了一些即使使用长读段进行组装仍然存在的局限性。
从为核基因组(重)测序创建的数据集中提取的SMRT测序读段可用于获得已测序生物体叶绿体的高质量从头组装体。即使核基因组的总体覆盖度相对较小,也有可能收集到足够多的读段来生成一个优于基于短读段的组装体的高质量组装体。然而,即使使用长读段,也并非总是能够可靠地确定叶绿体基因组序列元件的顺序,我们通过桑格技术生成的Fosmid末端序列(FES)证明了这一点。尽管如此,这一局限性也适用于短读段测序数据,但在这种情况下,在完成过程的更早阶段就会出现。