Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
The McDonnell Genome Institute at Washington University, Washington University School of Medicine, St. Louis, MO, USA.
Nat Methods. 2019 Jan;16(1):88-94. doi: 10.1038/s41592-018-0236-3. Epub 2018 Dec 17.
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
我们开发了一种基于长序列读段多倍体相位的计算方法,以解决基因组组装中片段重复区域的坍塌问题。片段重复组装器(SDA;https://github.com/mvollger/SDA)构建了一个图谱,其中同源序列变体定义节点,长读序列提供吸引和排斥边缘,从而能够对对应于不同同源的长读进行分区和组装。我们将其应用于来自三个人类基因组的单分子实时序列数据,并恢复了 33-79 兆碱基对(Mb)的重复序列,其中大约一半的基因座与参考基因组相比存在差异(<99.8%)。我们表明,相应的序列具有高度的准确性(>99.9%),并且与参考基因组中不存在的拷贝数可变同源物相对应。我们的方法可以应用于其他复杂的基因组,以解决最后富含基因的缺口,改进重复基因注释,并更好地理解碱基对水平的拷贝数变异遗传多样性。