Johnson Sean R, Weigele Peter R, Fomenkov Alexey, Ge Andrew, Vincze Anna, Eaglesham James B, Roberts Richard J, Sun Zhiyi
New England Biolabs Inc., Ipswich, MA 01938, USA.
Nucleic Acids Res. 2025 Jan 11;53(2). doi: 10.1093/nar/gkae1175.
The availability of large databases of biological sequences presents an opportunity for in-depth exploration of gene diversity and function. Bacterial defense systems are a rich source of diverse but difficult to annotate genes with biotechnological applications. In this work, we present Domainator, a flexible and modular software suite for domain-based gene neighborhood and protein search, extraction and clustering. We demonstrate the utility of Domainator through three examples related to bacterial defense systems. First, we cluster CRISPR-associated Rossman fold (CARF) containing proteins with difficult to annotate effector domains, classifying most of them as likely transcriptional regulators and a subset as likely RNases. Second, we extract and cluster P4-like phage satellite defense hotspots, identify an abundant variant of Lamassu defense systems and demonstrate its in vivo activity against several T-even phages. Third, we integrate a protein language model into Domainator and use it to identify restriction endonucleases with low similarity to known reference sequences, validating the activity of one example in vitro. Domainator is made available as an open-source package with detailed documentation and usage examples.
生物序列大型数据库的出现为深入探索基因多样性和功能提供了契机。细菌防御系统是具有生物技术应用价值但难以注释的多样基因的丰富来源。在这项工作中,我们展示了Domainator,这是一个灵活且模块化的软件套件,用于基于结构域的基因邻域和蛋白质搜索、提取及聚类。我们通过与细菌防御系统相关的三个例子展示了Domainator的实用性。首先,我们对含有难以注释的效应结构域的CRISPR相关罗斯曼折叠(CARF)蛋白进行聚类,将它们中的大多数分类为可能的转录调节因子,将一部分分类为可能的核糖核酸酶。其次,我们提取并聚类P4样噬菌体卫星防御热点,鉴定出拉玛苏防御系统的一种丰富变体,并证明其对几种T偶数噬菌体的体内活性。第三,我们将蛋白质语言模型集成到Domainator中,并使用它来鉴定与已知参考序列相似度低的限制性内切酶,在体外验证了一个例子的活性。Domainator作为一个开源软件包提供,带有详细的文档和使用示例。