Ogeh Denye, Badge Richard
Bioinformatics. 2017 Mar 1;33(5):650-653. doi: 10.1093/bioinformatics/btw687.
The advent of Next Generation Sequencing (NGS) has led to the generation of enormous volumes of short read sequence data, cheaply and in reasonable time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected, compared to those generated using Sanger DNA sequencing. This is largely due to the inability of short read sequence data to scaffold repetitive structures, creating gaps, inversions and rearrangements and resulting in assemblies that are, at best, draft forms. Third generation single-molecule sequencing (SMS) technologies (e.g. Pacific Biosciences Single Molecule Real Time (SMRT) system) address this challenge by generating sequences with increased read lengths, offering the prospect to better recover these complex repetitive structures, concomitantly improving assembly quality.
Here, we evaluate the ability of SMS data (specifically human genome Pacific Biosciences SMRT data) to recover poorly represented repetitive sequences (specifically, GC-rich human minisatellites). To do this we designed a pipeline for the collection, processing and local assembly of single-molecule sequence data to form accurate contiguous local reconstructions. Our results show the recovery of an allele of the non-coding minisatellite MS1 (located on chromosome 1 at 1p33-35) at greater than 97% identity to reference (GRCh38) from the unprocessed sequence data of a haploid complete hydatidiform mole (CHM1) cell line. Furthermore, our assembly revealed an allele of over 500 repeat units; much larger than the reference (GRCh38), but consistent in structure with naturally occurring alleles that are segregating in human populations. This local assembly's reconstruction was validated with the release of the whole genome assemblies GCA_001297185.1 and GCA_000772585.3, where this allele occurs. Additionally, application of this pipeline to coding minisatellites in the PRDM9 and ZNF93 genes enabled recovery of high identity allele structures for these sequence regions whose length was confirmed by PCR from cell line genomic DNA. The internal repeat structure of the PRDM9 allele recovered was consistent with common human-specific alleles.
Code available at https://github.com/ndliberial/smrt_pipeline.
新一代测序(NGS)技术的出现,使得在合理的时间范围内能够低成本地生成大量短读长序列数据。然而,与使用桑格DNA测序技术生成的基因组组装结果相比,使用NGS技术生成的基因组组装质量受到了很大影响。这主要是因为短读长序列数据无法对重复结构进行支架搭建,从而产生缺口、倒位和重排,导致组装结果充其量只是草图形式。第三代单分子测序(SMS)技术(如太平洋生物科学公司的单分子实时(SMRT)系统)通过生成更长读长的序列来应对这一挑战,有望更好地恢复这些复杂的重复结构,进而提高组装质量。
在此,我们评估了SMS数据(具体为人类基因组的太平洋生物科学公司SMRT数据)恢复代表性不足的重复序列(具体为富含GC的人类小卫星序列)的能力。为此,我们设计了一个流程,用于收集、处理和局部组装单分子序列数据,以形成准确的连续局部重建。我们的结果表明,从单倍体完全性葡萄胎(CHM1)细胞系的未处理序列数据中,非编码小卫星MS1(位于1号染色体1p33 - 35处)的一个等位基因以大于97%的一致性恢复到参考序列(GRCh38)。此外,我们的组装揭示了一个超过500个重复单元的等位基因;比参考序列(GRCh38)大得多,但在结构上与在人类群体中分离的自然存在的等位基因一致。在等位基因出现的全基因组组装GCA_001297185.1和GCA_000772585.3发布后,对该局部组装的重建进行了验证。此外,将此流程应用于PRDM9和ZNF93基因中的编码小卫星,能够恢复这些序列区域的高一致性等位基因结构,其长度通过从细胞系基因组DNA进行PCR得以确认。恢复的PRDM9等位基因的内部重复结构与常见的人类特异性等位基因一致。
代码可在https://github.com/ndliberial/smrt_pipeline获取。