Altemose Nicolas, Miga Karen H, Maggioni Mauro, Willard Huntington F
Genome Biology Group, Duke Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America.
Department of Mathematics, Duke University, Durham, North Carolina, United States of America.
PLoS Comput Biol. 2014 May 15;10(5):e1003628. doi: 10.1371/journal.pcbi.1003628. eCollection 2014 May.
The largest gaps in the human genome assembly correspond to multi-megabase heterochromatic regions composed primarily of two related families of tandem repeats, Human Satellites 2 and 3 (HSat2,3). The abundance of repetitive DNA in these regions challenges standard mapping and assembly algorithms, and as a result, the sequence composition and potential biological functions of these regions remain largely unexplored. Furthermore, existing genomic tools designed to predict consensus-based descriptions of repeat families cannot be readily applied to complex satellite repeats such as HSat2,3, which lack a consistent repeat unit reference sequence. Here we present an alignment-free method to characterize complex satellites using whole-genome shotgun read datasets. Utilizing this approach, we classify HSat2,3 sequences into fourteen subfamilies and predict their chromosomal distributions, resulting in a comprehensive satellite reference database to further enable genomic studies of heterochromatic regions. We also identify 1.3 Mb of non-repetitive sequence interspersed with HSat2,3 across 17 unmapped assembly scaffolds, including eight annotated gene predictions. Finally, we apply our satellite reference database to high-throughput sequence data from 396 males to estimate array size variation of the predominant HSat3 array on the Y chromosome, confirming that satellite array sizes can vary between individuals over an order of magnitude (7 to 98 Mb) and further demonstrating that array sizes are distributed differently within distinct Y haplogroups. In summary, we present a novel framework for generating initial reference databases for unassembled genomic regions enriched with complex satellite DNA, and we further demonstrate the utility of these reference databases for studying patterns of sequence variation within human populations.
人类基因组组装中最大的缺口对应于多兆碱基的异染色质区域,主要由两个相关的串联重复序列家族组成,即人类卫星2和3(HSat2,3)。这些区域中重复DNA的丰富性对标准的映射和组装算法提出了挑战,因此,这些区域的序列组成和潜在生物学功能在很大程度上仍未被探索。此外,现有的旨在预测基于共识的重复序列家族描述的基因组工具不能轻易应用于像HSat2,3这样缺乏一致重复单元参考序列的复杂卫星重复序列。在这里,我们提出了一种无需比对的方法,利用全基因组鸟枪法读取数据集来表征复杂卫星序列。利用这种方法,我们将HSat2,3序列分类为14个亚家族,并预测它们的染色体分布,从而生成一个全面的卫星参考数据库,以进一步推动对异染色质区域的基因组研究。我们还在17个未映射的组装支架中识别出1.3 Mb与HSat2,3穿插的非重复序列,其中包括8个注释基因预测。最后,我们将我们的卫星参考数据库应用于来自396名男性的高通量序列数据,以估计Y染色体上主要HSat3阵列的阵列大小变异,证实卫星阵列大小在个体之间可以相差一个数量级(7至98 Mb),并进一步证明阵列大小在不同的Y单倍群内分布不同。总之,我们提出了一个新的框架,用于为富含复杂卫星DNA的未组装基因组区域生成初始参考数据库,并进一步证明了这些参考数据库在研究人类群体内序列变异模式方面的实用性。