Harvard School of Public Health, Boston, Massachusetts 02115, USA.
Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA.
Genome Res. 2023 Dec 1;33(11):1994-2001. doi: 10.1101/gr.278005.123.
Satellite DNA are long tandemly repeating sequences in a genome and may be organized as high-order repeats (HORs). They are enriched in centromeres and are challenging to assemble. Existing algorithms for identifying satellite repeats either require the complete assembly of satellites or only work for simple repeat structures without HORs. Here we describe Satellite Repeat Finder (SRF), a new algorithm for reconstructing satellite repeat units and HORs from accurate reads or assemblies without prior knowledge on repeat structures. Applying SRF to real sequence data, we show that SRF could reconstruct known satellites in human and well-studied model organisms. We also find satellite repeats are pervasive in various other species, accounting for up to 12% of their genome contents but are often underrepresented in assemblies. With the rapid progress in genome sequencing, SRF will help the annotation of new genomes and the study of satellite DNA evolution even if such repeats are not fully assembled.
卫星 DNA 是基因组中长串联重复序列,可能被组织为高级重复序列(HORs)。它们富含着丝粒,组装起来具有挑战性。现有的识别卫星重复序列的算法要么需要完全组装卫星,要么仅适用于没有 HORs 的简单重复结构。在这里,我们描述了 Satellite Repeat Finder(SRF),这是一种从准确读取或组装中重建卫星重复单元和 HORs 的新算法,无需事先了解重复结构。将 SRF 应用于真实的序列数据,我们表明 SRF 可以重建人类和研究充分的模式生物中的已知卫星。我们还发现卫星重复序列在各种其他物种中普遍存在,占其基因组含量的高达 12%,但在组装中经常代表性不足。随着基因组测序的快速进展,即使这些重复序列没有完全组装,SRF 也将有助于新基因组的注释和卫星 DNA 进化的研究。