Rudenko Valentina, Korotkov Eugene
Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow 119071, Russia.
Int J Mol Sci. 2024 Apr 18;25(8):4441. doi: 10.3390/ijms25084441.
In this study, we applied the iterative procedure (IP) method to search for families of highly diverged dispersed repeats in the genome of , which contains over 16 million bases. The algorithm included the construction of position weight matrices (PWMs) for repeat families and the identification of more dispersed repeats based on the PWMs using dynamic programming. The results showed that the genome contained 20 repeat families comprising a total of 33,938 dispersed repeats, which is significantly more than has been previously found using other methods. The repeats varied in length from 108 to 600 bp (522.54 bp in average) and occupied more than 72% of the genome, whereas previously identified repeats, including tandem repeats, have been shown to constitute only about 28%. The high genomic content of dispersed repeats and their location in the coding regions suggest a significant role in the regulation of the functional activity of the genome.
在本研究中,我们应用迭代程序(IP)方法在含有超过1600万个碱基的基因组中搜索高度分化的分散重复序列家族。该算法包括构建重复序列家族的位置权重矩阵(PWM),以及使用动态规划基于PWM识别更多分散重复序列。结果表明,该基因组包含20个重复序列家族,共计33938个分散重复序列,这显著多于先前使用其他方法所发现的数量。这些重复序列长度从108到600 bp不等(平均为522.54 bp),占据了该基因组的72%以上,而先前鉴定出的重复序列,包括串联重复序列,仅占约28%。分散重复序列的高基因组含量及其在编码区域的位置表明它们在基因组功能活性调控中发挥着重要作用。