Institute of Biological Sciences, Federal University of Minas Gerais, Belo Horizonte, Brazil.
Simulation and Computational Biology Laboratory, High Performance Computing Center, Federal University of Pará, Belém, Brazil.
BMC Bioinformatics. 2024 Jun 18;25(1):217. doi: 10.1186/s12859-024-05842-2.
Tandem repeats are specific sequences in genomic DNA repeated in tandem that are present in all organisms. Among the subcategories of TRs we have Satellite repeats, that is divided into macrosatellites, minisatellites, and microsatellites, being the last two of specific interest because they can identify polymorphisms between organisms due to their instability. Currently, most mining tools focus on Simple Sequence Repeats (SSR) mining, and only a few can identify SSRs in the coding regions.
We developed a microsatellite mining software called SATIN (Micro and Mini SATellite IdentificatioN tool) based on a new sliding window algorithm written in C and Python. It represents a new approach to SSR mining by addressing the limitations of existing tools, particularly in coding region SSR mining. SATIN is available at https://github.com/labgm/SATIN.git . It was shown to be the second fastest for perfect and compound SSR mining. It can identify SSRs from coding regions plus SSRs with motif sizes bigger than 6. Besides the SSR mining, SATIN can also analyze SSRs polymorphism on coding-regions from pre-determined groups, and identify SSRs differentially abundant among them on a per-gene basis. To validate, we analyzed SSRs from two groups of Escherichia coli (K12 and O157) and compared the results with 5 known SSRs from coding regions. SATIN identified all 5 SSRs from 237 genes with at least one SSR on it.
The SATIN is a novel microsatellite search software that utilizes an innovative sliding window technique based on a numerical list for repeat region search to identify perfect, and composite SSRs while generating comprehensible and analyzable outputs. It is a tool capable of using files in fasta or GenBank format as input for microsatellite mining, also being able to identify SSRs present in coding regions for GenBank files. In conclusion, we expect SATIN to help identify potential SSRs to be used as genetic markers.
串联重复是基因组 DNA 中重复出现的特定序列,存在于所有生物中。在串联重复的亚类中,我们有卫星重复,它分为大卫星、微卫星和小卫星,后两者特别有趣,因为它们由于不稳定性可以识别生物之间的多态性。目前,大多数挖掘工具都专注于简单序列重复(SSR)挖掘,只有少数工具可以识别编码区中的 SSR。
我们开发了一种名为 SATIN(微卫星和小卫星识别工具)的微卫星挖掘软件,它基于用 C 和 Python 编写的新滑动窗口算法。它通过解决现有工具的局限性,特别是在编码区 SSR 挖掘方面,提供了一种 SSR 挖掘的新方法。SATIN 可在 https://github.com/labgm/SATIN.git 上获得。它在完美和复合 SSR 挖掘方面的速度排名第二。它可以识别编码区中的 SSR 和 motif 大小大于 6 的 SSR。除 SSR 挖掘外,SATIN 还可以分析来自预定组的编码区中的 SSR 多态性,并根据每个基因识别它们之间差异丰富的 SSR。为了验证,我们分析了来自大肠杆菌(K12 和 O157)两组的 SSR,并将结果与来自编码区的 5 个已知 SSR 进行比较。SATIN 从 237 个基因中识别出至少一个 SSR 的所有 5 个 SSR。
SATIN 是一种新颖的微卫星搜索软件,它利用基于数值列表的创新滑动窗口技术来搜索重复区域,以识别完美和复合 SSR,同时生成可理解和可分析的输出。它是一种能够使用 fasta 或 GenBank 格式的文件作为微卫星挖掘输入的工具,也能够识别 GenBank 文件中存在的编码区中的 SSR。总之,我们希望 SATIN 有助于识别潜在的 SSR,用作遗传标记。