Departamento de Parasitologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.
Department of Biology, York Biomedical Research Institute, University of York, York, UK.
BMC Bioinformatics. 2024 Jun 6;25(1):207. doi: 10.1186/s12859-024-05826-2.
Gene families are groups of homologous genes that often have similar biological functions. These families are formed by gene duplication events throughout evolution, resulting in multiple copies of an ancestral gene. Over time, these copies can acquire mutations and structural variations, resulting in members that may vary in size, motif ordering and sequence. Multigene families have been described in a broad range of organisms, from single-celled bacteria to complex multicellular organisms, and have been linked to an array of phenomena, such as host-pathogen interactions, immune evasion and embryonic development. Despite the importance of gene families, few approaches have been developed for estimating and graphically visualizing their diversity patterns and expression profiles in genome-wide studies.
Here, we introduce an R package named dgfr, which estimates and enables the visualization of sequence divergence within gene families, as well as the visualization of secondary data such as gene expression. The package takes as input a multi-fasta file containing the coding sequences (CDS) or amino acid sequences from a multigene family, performs a pairwise alignment among all sequences, and estimates their distance, which is subjected to dimension reduction, optimal cluster determination, and gene assignment to each cluster. The result is a dataset that allows for the visualization of sequence divergence and expression within the gene family, an approximation of the number of clusters present in the family.
dgfr provides a way to estimate and study the diversity of gene families, as well as visualize the dispersion and secondary profile of the sequences. The dgfr package is available at https://github.com/lailaviana/dgfr under the GPL-3 license.
基因家族是一组同源基因,通常具有相似的生物学功能。这些家族是通过进化过程中的基因复制事件形成的,导致一个祖先基因的多个副本。随着时间的推移,这些副本可能会发生突变和结构变异,从而导致成员在大小、基序排序和序列上有所不同。多基因家族已经在从单细胞细菌到复杂多细胞生物的广泛生物体中被描述过,并且与许多现象有关,如宿主-病原体相互作用、免疫逃避和胚胎发育。尽管基因家族很重要,但在基因组范围内研究中,很少有方法可以用于估计和图形化可视化它们的多样性模式和表达谱。
在这里,我们引入了一个名为 dgfr 的 R 包,该包用于估计和可视化基因家族内的序列分歧,以及可视化二级数据,如基因表达。该包以包含多基因家族的编码序列 (CDS) 或氨基酸序列的多 FASTA 文件作为输入,对所有序列进行两两比对,并估计它们的距离,然后进行降维、最佳聚类确定和基因分配到每个聚类。结果是一个数据集,允许可视化基因家族内的序列分歧和表达,以及对家族中存在的聚类数量的近似估计。
dgfr 提供了一种估计和研究基因家族多样性的方法,以及可视化序列的分散和二级谱。dgfr 包可在 https://github.com/lailaviana/dgfr 上获得,遵循 GPL-3 许可证。