Cechova Monika, Harris Robert S, Tomaszkiewicz Marta, Arbeithuber Barbara, Chiaromonte Francesca, Makova Kateryna D
Department of Biology, Pennsylvania State University, University Park, PA.
Department of Statistics, Pennsylvania State University, University Park, PA.
Mol Biol Evol. 2019 Nov 1;36(11):2415-2431. doi: 10.1093/molbev/msz156.
Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.
卫星重复序列是着丝粒和端粒的结构组成部分,在某些情况下,已知它们的差异会推动物种形成。由于其高度重复的性质,卫星序列在基因组组装中研究不足且代表性不足。为了研究它们在大猩猩中的更替情况,我们使用未组装的短读长和长读长测序数据,研究了人类、黑猩猩、倭黑猩猩、大猩猩以及苏门答腊猩猩和婆罗洲猩猩中长度达50bp的卫星重复序列。从准确的短读长(Illumina)中鉴定出的卫星重复序列密度在大猩猩基因组中差异很大。这些主要由少数丰富的重复基序主导,这些基序在物种间经常共享,形成了两组:1)(AATGG)n重复序列(对热休克反应至关重要)及其衍生物;2)参与端粒代谢的亚端粒32聚体。利用丰富重复序列的密度,可以将个体分类到物种中。然而,聚类并没有重现公认的物种系统发育,这表明重复序列进化迅速。有几个丰富的重复序列在雄性中比雌性中更丰富;利用Y染色体组装或荧光原位杂交技术,我们验证了它们在Y染色体上的位置。最后,应用一种新颖的计算工具,我们鉴定出许多完全嵌入长牛津纳米孔和太平洋生物科学公司读长中的卫星重复序列。这些重复序列长度可达59kb,由完美的重复序列与其他类似序列穿插组成。我们基于三种不同技术产生的测序读长所得到的结果,首次详细描述了大猩猩卫星重复序列,并为探索它们的功能开辟了新途径。