Uralsky L I, Shepelev V A, Alexandrov A A, Yurov Y B, Rogaev E I, Alexandrov I A
Institute of Molecular Genetics, Russian Academy of Sciences, Kurchatov Sq. 2, Moscow 123182, Russia.
Department of Genomics and Human Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119991, Russia.
Data Brief. 2019 Mar 8;24:103708. doi: 10.1016/j.dib.2019.103708. eCollection 2019 Jun.
In the latest hg38 human genome assembly, centromeric gaps has been filled in by alpha satellite (AS) reference models (RMs) which are statistical representations of homogeneous higher-order repeat (HOR) arrays that make up the bulk of the centromeric regions. We analyzed these models to compose an atlas of human AS HORs where each monomer of a HOR was represented by a number of its polymorphic sequence variants. We combined these data and HMMER sequence analysis platform to annotate AS HORs in the assembly. This led to discovery of a new type of low copy number highly divergent HORs which were not represented by RMs. These were included in the dataset. The annotation can be viewed as UCSC Genome Browser custom track (the HOR-track) and used together with our previous annotation of AS suprachromosomal families (SFs) in the same assembly, where each AS monomer can be viewed in its genomic context together with its classification into one of the 5 major SFs (the SF-track). To catalog the diversity of AS HORs in the human genome we introduced a new naming system. Each HOR received a name which showed its SF, chromosomal location and index number. Here we present the first installment of the HOR-track covering only the 17 HORs that belong to SF1 which forms live functional centromeres in chromosomes 1, 3, 5, 6, 7, 10, 12, 16 and 19 and also a large number of minor dead HOR domains, both homogeneous and divergent. Monomer-by-monomer HOR annotation used for this dataset as opposed to annotation of whole HOR repeats provides for mapping and quantification of various structural variants of AS HORs which can be used to collect data on inter-individual polymorphism of AS.
在最新的hg38人类基因组组装中,着丝粒间隙已由α卫星(AS)参考模型(RMs)填补,这些模型是构成着丝粒区域主体的同源高阶重复(HOR)阵列的统计表示。我们分析了这些模型,以构建人类AS HOR图谱,其中HOR的每个单体由其多个多态性序列变体表示。我们将这些数据与HMMER序列分析平台相结合,以注释组装中的AS HOR。这导致发现了一种新型的低拷贝数高度分化的HOR,而RMs未对其进行表示。这些被纳入数据集中。该注释可作为UCSC基因组浏览器的自定义轨道(HOR轨道)查看,并与我们之前在同一组装中对AS超染色体家族(SFs)的注释一起使用,在该注释中,每个AS单体可以在其基因组背景中查看,并分类到5个主要SFs之一(SF轨道)。为了编目人类基因组中AS HOR的多样性,我们引入了一种新的命名系统。每个HOR都有一个名称,显示其SF、染色体位置和索引号。在这里,我们展示了HOR轨道的第一部分,仅涵盖属于SF1的17个HOR,SF1在染色体1、3、5、6、7、10、12、16和19中形成活的功能着丝粒,以及大量较小的无功能HOR结构域,包括同源和分化的。与整个HOR重复序列的注释相反,用于该数据集的逐个单体的HOR注释提供了AS HOR各种结构变体的映射和定量,可用于收集AS个体间多态性的数据。