Pryszcz Leszek P, Gabaldón Toni
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain International Institute of Molecular and Cell Biology, Warsaw, Poland.
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluís Companys 23, 08010 Barcelona, Spain
Nucleic Acids Res. 2016 Jul 8;44(12):e113. doi: 10.1093/nar/gkw294. Epub 2016 Apr 29.
Many genomes display high levels of heterozygosity (i.e. presence of different alleles at the same loci in homologous chromosomes), being those of hybrid organisms an extreme such case. The assembly of highly heterozygous genomes from short sequencing reads is a challenging task because it is difficult to accurately recover the different haplotypes. When confronted with highly heterozygous genomes, the standard assembly process tends to collapse homozygous regions and reports heterozygous regions in alternative contigs. The boundaries between homozygous and heterozygous regions result in multiple assembly paths that are hard to resolve, which leads to highly fragmented assemblies with a total size larger than expected. This, in turn, causes numerous problems in downstream analyses such as fragmented gene models, wrong gene copy number, or broken synteny. To circumvent these caveats we have developed a pipeline that specifically deals with the assembly of heterozygous genomes by introducing a step to recognise and selectively remove alternative heterozygous contigs. We tested our pipeline on simulated and naturally-occurring heterozygous genomes and compared its accuracy to other existing tools. Our method is freely available at https://github.com/Gabaldonlab/redundans.
许多基因组表现出高度的杂合性(即同源染色体上相同位点存在不同的等位基因),杂交生物的基因组就是这种极端情况。从短测序读段组装高度杂合的基因组是一项具有挑战性的任务,因为很难准确恢复不同的单倍型。面对高度杂合的基因组时,标准的组装过程往往会使纯合区域塌陷,并在替代重叠群中报告杂合区域。纯合区域和杂合区域之间的边界导致多条难以解析的组装路径,这会导致组装高度碎片化,总大小超过预期。反过来,这又会在下游分析中引发诸多问题,如基因模型碎片化、基因拷贝数错误或同线性破坏。为了规避这些问题,我们开发了一种流程,通过引入识别并选择性去除替代杂合重叠群的步骤,专门处理杂合基因组的组装。我们在模拟的和天然存在的杂合基因组上测试了我们的流程,并将其准确性与其他现有工具进行了比较。我们的方法可在https://github.com/Gabaldonlab/redundans上免费获取。