基于重叠图的二倍体和多倍体单倍型生成。

Overlap graph-based generation of haplotigs for diploids and polyploids.

机构信息

Centrum Wiskunde & Informatica, XG Amsterdam, The Netherlands.

Theoretical Biology and Bioinformatics, Utrecht University, CH Utrecht, The Netherlands.

出版信息

Bioinformatics. 2019 Nov 1;35(21):4281-4289. doi: 10.1093/bioinformatics/btz255.

DOI:10.1093/bioinformatics/btz255

PMID:30994902

Abstract

MOTIVATION

Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity.

RESULTS

We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings.

AVAILABILITY AND IMPLEMENTATION

POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

单倍型感知基因组组装在遗传学、医学和其他各种学科中都起着重要作用，但生成单倍型解析的从头组装仍然是一个主要挑战。除了区分错误和真实的顺序变体外，还需要将真实的变体分配到不同的基因组副本。最近的工作指出，迄今为止，传统的 NGS 读取数据在单倍型计算方面还远远没有得到充分利用，这反映出参考独立单倍型计算的方法尚未成熟。

结果

我们提出了 POLYploid genome fitTEr（POLYTE），作为一种新的方法，用于生成已知ploidy 的二倍体和多倍体基因组的单倍型。我们的方法遵循一个迭代方案，在每个迭代中，根据潜在的单倍型感知重叠图，读取或 contigs 会根据它们的相互作用进行连接。在迭代过程中， contigs 在保持其单倍型身份的同时增长。在真实和模拟数据上的基准测试实验表明，POLYTE 在错误免费重建单倍型特异性序列方面建立了新的标准。因此，POLYTE 在各种相关方面都优于最先进的方法，在多倍体环境中优势尤为明显。