Takeda Atsushi, Nonaka Daisuke, Imazu Yuta, Fukunaga Tsukasa, Hamada Michiaki
Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, 1698555, Japan.
Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, 1698555, Japan.
Mob DNA. 2025 Apr 3;16(1):16. doi: 10.1186/s13100-025-00353-0.
Interspersed repeats occupy a large part of many eukaryotic genomes, and thus their accurate annotation is essential for various genome analyses. Database-free de novo repeat detection approaches are powerful for annotating genomes that lack well-curated repeat databases. However, existing tools do not yet have sufficient repeat detection performance.
In this study, we developed REPrise, a de novo interspersed repeat detection software program based on a seed-and-extension method. Although the algorithm of REPrise is similar to that of RepeatScout, which is currently the de facto standard tool, we incorporated three unique techniques into REPrise: inexact seeding, affine gap scoring and loose masking. Analyses of rice and simulation genome datasets showed that REPrise outperformed RepeatScout in terms of sensitivity, especially when the repeat sequences contained many mutations. Furthermore, when applied to the complete human genome dataset T2T-CHM13, REPrise demonstrated the potential to detect novel repeat sequence families.
REPrise can detect interspersed repeats with high sensitivity even in long genomes. Our software enhances repeat annotation in diverse genomic studies, contributing to a deeper understanding of genomic structures.
散布重复序列占据了许多真核生物基因组的很大一部分,因此它们的准确注释对于各种基因组分析至关重要。无数据库的从头重复序列检测方法对于注释缺乏精心整理的重复序列数据库的基因组很有效。然而,现有工具的重复序列检测性能仍不够充分。
在本研究中,我们开发了REPrise,这是一种基于种子扩展法的从头散布重复序列检测软件程序。虽然REPrise的算法与目前事实上的标准工具RepeatScout的算法相似,但我们在REPrise中融入了三种独特技术:不精确种子设定、仿射空位计分和宽松掩码。对水稻和模拟基因组数据集的分析表明,REPrise在灵敏度方面优于RepeatScout,尤其是当重复序列包含许多突变时。此外,当应用于完整的人类基因组数据集T2T-CHM13时,REPrise展示了检测新重复序列家族的潜力。
REPrise即使在长基因组中也能以高灵敏度检测散布重复序列。我们的软件增强了各种基因组研究中的重复序列注释,有助于更深入地理解基因组结构。