Sankararaman Abishek, Vikalo Haris, Baccelli François
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
Department of Mathematics, The University of Texas at Austin, Austin, TX, USA.
BMC Genomics. 2020 Sep 9;21(Suppl 9):586. doi: 10.1186/s12864-020-06935-x.
Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.
We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.
Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
单倍型是区分染色体序列与其同源对的单核苷酸变异的有序列表,它可能揭示个体对遗传性疾病和复杂疾病的易感性,并影响我们身体对治疗药物的反应。从短测序读段重建个体的单倍型是一个NP难问题,在多倍体情况下更具挑战性。虽然增加测序读段长度和插入片段大小有助于提高重建准确性,但也加剧了单倍型组装任务的计算复杂性。这促使人们寻求能够从高通量测序数据中准确且高效地组装单倍型的算法框架。
我们提出了一种新颖的测序读段图形表示方法,并将单倍型组装问题作为空间随机图上的社区检测实例提出。为此,我们构建了一个图,其中每个读段是一个节点,带有一个未知的社区标签,该标签将读段与其采样的单倍型相关联。然后,单倍型重建可以被视为一个两步过程:首先,恢复节点(即读段)上的社区标签,然后使用估计的标签组装单倍型。基于这一观察结果,我们提出了ComHapDet——一种用于二倍体和多倍体单倍型的新颖组装算法,它允许双等位基因和多等位基因变异。
所提出算法的性能在模拟数据以及通过对四倍体双等位基因马铃薯5号染色体进行测序获得的实验数据上进行了基准测试。结果证明了所提方法的有效性,并且与现有技术相比具有优势。