1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay.
2Sección Biomatemática - Unidad de Genómica Evolutiva, Facultad de Ciencias-UDELAR, Montevideo, Uruguay.
Microb Genom. 2018 May;4(5). doi: 10.1099/mgen.0.000177. Epub 2018 Apr 30.
Although the genome of Trypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (the abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degrees of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated with T. cruzi's genome since they permit direct determination of the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, not only allows accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of two T. cruzi clones: the hybrid TCC (TcVI) and the non-hybrid Dm28c (TcI), determined by PacBio Single Molecular Real-Time (SMRT) technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome of T. cruzi is composed of a 'core compartment' and a 'disruptive compartment' which exhibit opposite GC content and gene composition. Novel tandem and dispersed repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families, mucins and trans-sialidases allows now a better overview of these complex groups of genes.
尽管克氏锥虫(恰加斯病的病原体)的基因组于 2005 年首次公布,随后又报告了其他一些株系,但这种寄生虫的固有基因组复杂性(重复序列和串联排列的基因丰富)传统上阻碍了高质量基因组组装和注释。这也限制了需要高精度的各种类型的分析。第三代测序技术产生的长读长特别适合解决克氏锥虫基因组相关的挑战,因为它们可以直接确定大量重复序列簇的完整序列,而不会将其折叠。这不仅允许准确估计基因拷贝数,还可以避免组装碎片化。在这里,我们展示了使用 PacBio 单分子实时(SMRT)技术确定的两个克氏锥虫克隆(杂种 TCC(TcVI)和非杂种 Dm28c(TcI))的基因组序列分析。本文获得的改进组装允许我们准确估计基因拷贝数、重复序列(包括卫星和反转元件)的丰度和分布。我们发现克氏锥虫的基因组由一个“核心区”和一个“破坏区”组成,它们具有相反的 GC 含量和基因组成。鉴定了新的串联和分散重复序列,包括一些位于编码序列内部的重复序列。此外,还分别组装了同源染色体,使我们能够将单倍型检索为单独的连续体,而不是唯一的镶嵌序列。最后,对表面多基因家族、粘蛋白和转涎酸酶进行手动注释,现在可以更好地概述这些复杂的基因群。