Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A1, Canada.
Donnelly Centre, University of Toronto, Toronto, ON M5S 1A1, Canada.
Genetics. 2022 Jul 4;221(3). doi: 10.1093/genetics/iyac074.
Sequences derived from the Long INterspersed Element-1 (L1) family of retrotransposons occupy at least 17% of the human genome, with 67 distinct subfamilies representing successive waves of expansion and extinction in mammalian lineages. L1s contribute extensively to gene regulation, but their molecular history is difficult to trace, because most are present only as truncated and highly mutated fossils. Consequently, L1 entries in current databases of repeat sequences are composed mainly of short diagnostic subsequences, rather than full functional progenitor sequences for each subfamily. Here, we have coupled 2 levels of sequence reconstruction (at the level of whole genomes and L1 subfamilies) to reconstruct progenitor sequences for all human L1 subfamilies that are more functionally and phylogenetically plausible than existing models. Most of the reconstructed sequences are at or near the canonical length of L1s and encode uninterrupted ORFs with expected protein domains. We also show that the presence or absence of binding sites for KRAB-C2H2 Zinc Finger Proteins, even in ancient-reconstructed progenitor L1s, mirrors binding observed in human ChIP-exo experiments, thus extending the arms race and domestication model. RepeatMasker searches of the modern human genome suggest that the new models may be able to assign subfamily resolution identities to previously ambiguous L1 instances. The reconstructed L1 sequences will be useful for genome annotation and functional study of both L1 evolution and L1 contributions to host regulatory networks.
长散布元件-1(L1)家族的序列占据了人类基因组的至少 17%,其中 67 个不同的亚家族代表了哺乳动物谱系中连续的扩张和灭绝浪潮。L1 广泛参与基因调控,但由于大多数 L1 仅以截断和高度突变的化石形式存在,因此其分子历史难以追踪。因此,重复序列当前数据库中的 L1 条目主要由短的诊断亚序列组成,而不是每个亚家族的完整功能原始序列。在这里,我们结合了 2 个序列重建水平(整个基因组和 L1 亚家族),以重建所有人类 L1 亚家族的原始序列,这些序列比现有模型更具功能和系统发育合理性。大多数重建的序列处于或接近 L1 的典型长度,并编码具有预期蛋白结构域的不间断 ORF。我们还表明,KRAB-C2H2 锌指蛋白结合位点的存在与否,即使在古老的原始 L1 中,也反映了在人类 ChIP-exo 实验中观察到的结合,从而扩展了军备竞赛和驯化模型。对现代人类基因组的 RepeatMasker 搜索表明,新模型可能能够将亚家族分辨率身份分配给以前不明确的 L1 实例。重建的 L1 序列将有助于基因组注释和 L1 进化以及 L1 对宿主调控网络的贡献的功能研究。