Department of Civil and Environmental Engineering, Northwestern University, Evanston, IL 60208, USA.
Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA.
Bioinformatics. 2022 Jan 12;38(3):612-620. doi: 10.1093/bioinformatics/btab752.
Identifying variant forms of gene clusters of interest in phylogenetically proximate and distant taxa can help to infer their evolutionary histories and functions. Conserved gene clusters may differ by only a few genes, but these small differences can in turn induce substantial phenotypes, such as by the formation of pseudogenes or insertions interrupting regulation. Particularly as microbial genomes and metagenomic assemblies become increasingly abundant, unsupervised grouping of similar, but not necessarily identical, gene clusters into consistent bins can provide a population-level understanding of their gene content variation and functional homology.
We developed GeneGrouper, a command-line tool that uses a density-based clustering method to group gene clusters into bins. GeneGrouper demonstrated high recall and precision in benchmarks for the detection of the 23-gene Salmonella enterica LT2 Pdu gene cluster and four-gene Pseudomonas aeruginosa PAO1 Mex gene cluster among 435 genomes spanning mixed taxa. In a subsequent application investigating the diversity and impact of gene-complete and -incomplete LT2 Pdu gene clusters in 1130 S.enterica genomes, GeneGrouper identified a novel, frequently occurring pduN pseudogene. When investigated in vivo, introduction of the pduN pseudogene negatively impacted microcompartment formation. We next demonstrated the versatility of GeneGrouper by clustering distant homologous gene clusters and variable gene clusters found in integrative and conjugative elements.
GeneGrouper software and code are publicly available at https://pypi.org/project/GeneGrouper/.
Supplementary data are available at Bioinformatics online.
在系统发育上相近和遥远的分类群中识别感兴趣的基因簇的变体形式有助于推断它们的进化历史和功能。保守的基因簇可能只在少数基因上有所不同,但这些微小的差异反过来又会引起显著的表型,例如通过假基因的形成或中断调节的插入。特别是随着微生物基因组和宏基因组组装变得越来越丰富,将相似但不一定完全相同的基因簇未经监督地分组到一致的箱中,可以提供对其基因内容变异和功能同源性的群体水平理解。
我们开发了 GeneGrouper,这是一种命令行工具,它使用基于密度的聚类方法将基因簇分组到箱中。GeneGrouper 在检测跨越混合分类群的 435 个基因组中的 23 基因沙门氏菌肠 LT2 Pdu 基因簇和 4 基因铜绿假单胞菌 PAO1 Mex 基因簇的 23 个基因沙门氏菌肠 LT2 Pdu 基因簇和 4 基因铜绿假单胞菌 PAO1 Mex 基因簇的检测基准测试中表现出高召回率和精度。在随后的一项应用中,我们调查了 1130 个 S.enterica 基因组中基因完整和不完整 LT2 Pdu 基因簇的多样性和影响,GeneGrouper 鉴定了一个新的、频繁出现的 pduN 假基因。当在体内研究时,pduN 假基因的引入对微区室形成产生负面影响。我们接下来通过聚类远缘同源基因簇和整合和共轭元件中发现的可变基因簇展示了 GeneGrouper 的多功能性。
GeneGrouper 软件和代码可在 https://pypi.org/project/GeneGrouper/ 上公开获得。
补充数据可在生物信息学在线获得。