Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA.
Brief Bioinform. 2018 Sep 28;19(5):893-904. doi: 10.1093/bib/bbx037.
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
目前的变异发现方法通常依赖于对参考序列的初始读取映射。它们的有效性受到参考序列中存在的间隙、潜在的错误组装、具有高序列相似性的重复区域和高序列差异区域的限制。此外,基于映射的方法对大型 INDEL 和复杂变异的敏感性较低,并且在个人基因组中提供的相位信息较少。已经开发了一些从头组装程序,通过从组装图、微组装和全基因组组装中直接进行变体调用来识别变体,但主要用于全基因组测序 (WGS) 数据。我们开发了 SGVar,这是一种从头组装工作流程,用于从全外显子组测序 (WES) 数据中发现单倍型变体。使用模拟的人类外显子组数据,我们将 SGVar 与五个变体感知的从头组装程序以及 BWA-MEM 与三种基于单倍型或局部从头组装的调用器进行了比较。SGVar 在敏感性和对测序错误的容忍度方面优于其他组装程序。我们在具有北和西欧血统的犹他州居民的全基因组和外显子组数据上重现了这些发现,结果表明 SGVar 在高度分化的人类白细胞抗原 (HLA) 区域和 6 号染色体的非 HLA 区域均具有很高的敏感性。特别是,SGVar 对测序错误、k-mer 选择、分化水平和覆盖深度具有鲁棒性。与基于映射的方法不同,SGVar 能够从 WES 解析长程相位并识别大型 INDEL,从 WGS 更为显著。我们得出结论,SGVar 代表了在高度分化区域和整个基因组中进行基于 WES 的变体发现的理想平台。