Mendel Centre for Plant Genomics and Proteomics, CEITEC Masaryk University, Brno CZ-62500, Czech Republic.
Laboratory of Functional Genomics and Proteomics, NCBR, Faculty of Science, Masaryk University, Brno CZ-61137, Czech Republic.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad080. Epub 2023 Oct 17.
While web-based tools such as BLAST have made identifying conserved gene homologs appear easy, genes with variable sequences pose significant challenges. Functionally important noncoding RNAs (ncRNA) often show low sequence conservation due to genetic variations, including insertions and deletions. Rather than conserved sequences, these RNAs possess highly conserved structural features across a broad phylogenetic range. Such features can be identified using the covariance models approach, which combines sequence alignment with a secondary RNA structure consensus. However, running standard implementation of that approach (Infernal) requires advanced bioinformatics knowledge compared to user-friendly web services like BLAST. The issue is partially addressed by RNAcentral, which can be used to search for homologs across a broad range of ncRNA sequence collections from diverse organisms but not across the genome assemblies.
Here, we present GERONIMO, which conducts evolutionary searches across hundreds of genomes in a fully automated way. It provides results extended with taxonomy context, as summary tables and visualizations, to facilitate analysis for user convenience. Additionally, GERONIMO supplements homologous sequences with genomic regions to analyze promoter motifs or gene collinearity, enhancing the validation of results.
GERONIMO, built using Snakemake, has undergone extensive testing on hundreds of genomes, establishing itself as a valuable tool in the identification of ncRNA homologs across diverse taxonomic groups. Consequently, GERONIMO facilitates the investigation of the evolutionary patterns of functionally significant ncRNA players, whose understanding has previously been limited to individual organisms and close relatives.
虽然基于网络的工具,如 BLAST,使得识别保守基因同源物看起来很容易,但具有可变序列的基因却带来了重大挑战。由于遗传变异,包括插入和缺失,功能重要的非编码 RNA(ncRNA)通常表现出低序列保守性。这些 RNA 具有高度保守的结构特征,而不是保守序列,这些特征可以通过协方差模型方法来识别,该方法将序列比对与二级 RNA 结构共识结合使用。然而,与 BLAST 等用户友好的网络服务相比,运行该方法的标准实现(Infernal)需要高级生物信息学知识。RNAcentral 部分解决了这个问题,它可以用于在广泛的 ncRNA 序列集合中搜索同源物,这些序列集合来自于不同的生物体,但不能跨越基因组组装。
在这里,我们介绍了 GERONIMO,它以全自动的方式在数百个基因组中进行进化搜索。它提供了扩展了分类学上下文的结果,作为摘要表和可视化,以方便用户进行分析。此外,GERONIMO 用基因组区域补充同源序列,以分析启动子模体或基因共线性,从而增强结果的验证。
GERONIMO 是使用 Snakemake 构建的,已经在数百个基因组上进行了广泛的测试,它是识别不同分类群中 ncRNA 同源物的有价值的工具。因此,GERONIMO 促进了对功能重要的 ncRNA 参与者的进化模式的研究,而这些参与者的理解以前仅限于单个生物体和近亲。