Bronski Michael J, Martinez Ciera C, Weld Holli A, Eisen Michael B
Department of Molecular and Cell Biology, University of California, Berkeley
Department of Molecular and Cell Biology, University of California, Berkeley.
G3 (Bethesda). 2020 May 4;10(5):1443-1455. doi: 10.1534/g3.119.400959.
Large groups of species with well-defined phylogenies are excellent systems for testing evolutionary hypotheses. In this paper, we describe the creation of a comparative genomic resource consisting of 23 genomes from the species-rich species group, 22 of which are presented here for the first time. The group is well-positioned for clade genomics. Within the clade, evolutionary distances are such that large numbers of sequences can be accurately aligned while also recovering strong signals of divergence; and the distance between the group and is short enough so that orthologous sequence can be readily identified. All genomes were assembled from a single, small-insert library using MaSuRCA, before going through an extensive post-assembly pipeline. Estimated genome sizes within the group range from 155 Mb to 223 Mb (mean = 196 Mb). The absence of long-distance information during the assembly process resulted in fragmented assemblies, with the scaffold NG50s varying widely based on repeat content and sample heterozygosity (min = 18 kb, max = 390 kb, mean = 74 kb). The total scaffold length for most assemblies is also shorter than the estimated genome size, typically by 5-15%. However, subsequent analysis showed that our assemblies are highly complete. Despite large differences in contiguity, all assemblies contain at least 96% of known single-copy Dipteran genes (BUSCOs, n = 2,799). Similarly, by aligning our assemblies to the genome and remapping coordinates for a large set of transcriptional enhancers (n = 3,457), we showed that each assembly contains orthologs for at least 91% of enhancers. Importantly, the genic and enhancer contents of our assemblies are comparable to that of far more contiguous assemblies. The alignment of our own assembly to a previously published PacBio assembly also showed that our longest scaffolds (up to 1 Mb) are free of large-scale misassemblies. Our genome assemblies are a valuable resource that can be used to further resolve the group phylogeny; study the evolution of protein-coding genes and -regulatory sequences; and determine the genetic basis of ecological and behavioral adaptations.
具有明确系统发育关系的大量物种群体是检验进化假说的优秀系统。在本文中,我们描述了一个比较基因组资源的创建,该资源由来自物种丰富的物种群体的23个基因组组成,其中22个在此首次呈现。该群体非常适合进行分支基因组学研究。在该进化枝内,进化距离使得大量序列能够准确比对,同时还能恢复强烈的分化信号;并且该群体与其他群体之间的距离足够短,以便能够轻松识别直系同源序列。所有基因组都是使用MaSuRCA从单个小插入片段文库组装而成,然后经过广泛的组装后流程。该群体内估计的基因组大小范围为155 Mb至223 Mb(平均 = 196 Mb)。组装过程中缺乏长距离信息导致组装片段化,支架NG50s因重复序列含量和样本杂合性而差异很大(最小值 = 18 kb,最大值 = 390 kb,平均值 = 74 kb)。大多数组装的总支架长度也比估计的基因组大小短,通常短5 - 15%。然而,后续分析表明我们的组装非常完整。尽管连续性差异很大,但所有组装都包含至少96%的已知单拷贝双翅目基因(BUSCOs,n = 2799)。同样,通过将我们的组装与另一个基因组比对并重新映射大量转录增强子(n = 3457)的坐标,我们表明每个组装都包含至少91%的该增强子的直系同源物。重要的是,我们组装的基因和增强子含量与连续性高得多的其他组装相当。将我们自己的组装与先前发表的PacBio组装比对也表明,我们最长的支架(长达1 Mb)没有大规模的错误组装。我们的基因组组装是一种有价值的资源,可用于进一步解析该群体的系统发育;研究蛋白质编码基因和调控序列的进化;以及确定生态和行为适应的遗传基础。