Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA.
Genome Res. 2018 May;28(5):714-725. doi: 10.1101/gr.231472.117. Epub 2018 Mar 27.
Eukaryotic genomes are replete with repeated sequences in the form of transposable elements (TEs) dispersed across the genome or as satellite arrays, large stretches of tandemly repeated sequences. Many satellites clearly originated as TEs, but it is unclear how mobile genetic parasites can transform into megabase-sized tandem arrays. Comprehensive population genomic sampling is needed to determine the frequency and generative mechanisms of tandem TEs, at all stages from their initial formation to their subsequent expansion and maintenance as satellites. The best available population resources, short-read DNA sequences, are often considered to be of limited utility for analyzing repetitive DNA due to the challenge of mapping individual repeats to unique genomic locations. Here we develop a new pipeline called ConTExt that demonstrates that paired-end Illumina data can be successfully leveraged to identify a wide range of structural variation within repetitive sequence, including tandem elements. By analyzing 85 genomes from five populations of , we discover that TEs commonly form tandem dimers. Our results further suggest that insertion site preference is the major mechanism by which dimers arise and that, consequently, dimers form rapidly during periods of active transposition. This abundance of TE dimers has the potential to provide source material for future expansion into satellite arrays, and we discover one such copy number expansion of the DNA transposon to approximately 16 tandem copies in a single line. The very process that defines TEs-transposition-thus regularly generates sequences from which new satellites can arise.
真核生物基因组中充满了重复序列,这些重复序列以转座元件 (TEs) 的形式存在,散布在基因组中或作为卫星阵列,即大片串联重复序列。许多卫星显然起源于 TEs,但移动遗传寄生虫如何转化为兆碱基大小的串联阵列尚不清楚。需要全面的群体基因组采样来确定串联 TE 的频率和生成机制,从它们的初始形成到随后的扩展和作为卫星的维持的所有阶段。由于将单个重复映射到唯一基因组位置的挑战,最好的可用群体资源,即短读 DNA 序列,通常被认为对于分析重复 DNA 的用途有限。在这里,我们开发了一个名为 ConTExt 的新管道,该管道表明,双端 Illumina 数据可成功用于识别重复序列内的广泛结构变异,包括串联元件。通过分析来自五个群体的 85 个基因组,我们发现 TEs 通常形成串联二聚体。我们的结果进一步表明,插入位点偏好是二聚体产生的主要机制,因此,在活跃转座期间,二聚体迅速形成。TE 二聚体的这种丰富度有可能为未来的卫星阵列扩展提供源材料,我们发现一个 DNA 转座子 的这样一个拷贝数扩展到大约 16 个串联拷贝在单个系中。定义 TEs 的过程——转座——因此经常产生新卫星可以出现的序列。