Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, CA, 92697-2525, USA.
BMC Bioinformatics. 2021 Jan 6;22(1):9. doi: 10.1186/s12859-020-03939-y.
Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding.
Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb).
HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito's largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.
尽管长读测序技术最近取得了显著进展,但二倍体基因组的组装仍然是一项艰巨的任务。一个主要的障碍是区分代表高度杂合区域的替代 contig。如果 primary 和 secondary contigs 没有被正确识别,那么主要的组装将过度代表基因组的大小和复杂性,这会使下游分析(如 scaffolding)变得复杂。
在这里,我们展示了一种新的方法,我们称之为 HapSolo,它可以识别 secondary contigs,并基于多个成对 contig 比对度量来定义 primary 组装。HapSolo 使用 BUSCO 分数来评估候选的 primary 组装,然后使用成本函数来区分候选组装。成本函数可以由用户定义,但默认情况下会考虑组装中缺失、重复和单 BUSCO 基因的数量。HapSolo 通过 hill climbing 来最小化数千个候选组装的成本。我们在来自三个物种的基因组数据上展示了 HapSolo 的性能:霞多丽葡萄(Vitis vinifera),基因组大小为 490 Mb,蚊子(Anopheles funestus;200 Mb)和刺鲨(Amblyraja radiata;2650 Mb)。
HapSolo 快速识别出候选组装,这些组装在组装指标上有所改进,包括基因组大小的减小和 N50 分数的提高。与未经简化的主要组装相比,霞多丽、蚊子和刺鲨的 contig N50 分数分别提高了 35%、9%和 9%。HapSolo 的下游分析也放大了这些好处,我们通过使用 Hi-C 数据进行 scaffolding 来说明这一点。例如,我们发现,在应用 HapSolo 之前,霞多丽基因组中只有 52%被包含在最大的 19 个 scaffolds 中,这对应于染色体的数量。在应用 HapSolo 之后,这个值增加到了~84%。蚊子最大的三个 scaffolds 的数量,代表染色体的数量,从 61%提高到了 86%,而对于刺鲨来说,这个提高更为显著。我们将 scaffolding 结果与基于 PurgeDups 识别 secondary contigs 的组装进行了比较,HapSolo 的结果通常更优。