Myers Brendon K, Lamichhane Anuj, Kvitko Brian H, Dutta Bhabesh
Department of Plant Pathology, The University of Georgia, Tifton, Georgia, USA.
Department of Plant Pathology, The University of Georgia, Athens, Georgia, USA.
mSphere. 2025 Jul 29;10(7):e0002325. doi: 10.1128/msphere.00023-25. Epub 2025 Jun 17.
Allicin tolerance () clusters in phytopathogenic bacteria, which provide resistance to thiosulfinates like allicin, are challenging to find using conventional approaches due to their varied architecture and the paradox of being vertically maintained within genera despite likely being horizontally transferred. This results in significant sequential diversity that further complicates their identification. Natural language processing (NLP), like techniques such as those used in DeepBGC, offers a promising solution by treating gene clusters like a language, allowing for identifying and collecting gene clusters based on patterns and relationships within the sequences. We curated and validated -like clusters in 97-1R, pv. FDAARGOS 389, and pv. tomato DC3000. Leveraging sequences from the RefSeq bacterial database, we conducted comparative analyses of gene synteny, gene/protein sequences, protein structures, and predicted protein interactions. This approach enabled the discovery of several novel -like clusters previously undetectable by other methods, which were further validated experimentally. Our work highlights the effectiveness of NLP-like techniques for identifying underrepresented gene clusters and expands our understanding of the diversity and utility of -like clusters in diverse bacterial genera. This work demonstrates the potential of these techniques to simplify the identification process and enhance the applicability of biological data in real-world scenarios.IMPORTANCEThiosulfinates, like allicin, are potent antifeedants and antimicrobials produced by species and pose a challenge for phytopathogenic bacteria. Phytopathogenic bacteria have been shown to utilize an allicin tolerance () gene cluster to circumvent this host response, leading to economically significant yield losses. Due to the complexity of mining these clusters, we applied techniques akin to natural language processing to analyze Pfam domains and gene proximity. This approach led to the identification of novel -like gene clusters, showcasing the potential of artificial intelligence to reveal elusive and underrepresented genetic clusters and enhance our understanding of their diversity and role across various bacterial genera.
植物致病细菌中对大蒜素等硫代亚磺酸盐具有抗性的大蒜素耐受性()基因簇,由于其结构多样,且尽管可能是水平转移但在属内垂直维持的矛盾特性,使用传统方法很难找到。这导致了显著的序列多样性,进一步使其鉴定变得复杂。自然语言处理(NLP),如DeepBGC中使用的技术,通过将基因簇视为一种语言,根据序列中的模式和关系来识别和收集基因簇,提供了一个有前景的解决方案。我们在97 - 1R、辣椒疫霉pv. capsici FDAARGOS 389和番茄丁香假单胞菌pv. tomato DC3000中策划并验证了类似的基因簇。利用来自RefSeq细菌数据库的序列,我们对基因共线性、基因/蛋白质序列、蛋白质结构和预测的蛋白质相互作用进行了比较分析。这种方法使得发现了几个以前其他方法无法检测到的新型类似基因簇,并通过实验进一步验证。我们的工作突出了类似NLP技术在识别代表性不足的基因簇方面的有效性,并扩展了我们对不同细菌属中类似基因簇的多样性和实用性的理解。这项工作证明了这些技术在简化鉴定过程以及增强生物数据在实际场景中的适用性方面的潜力。
重要性
硫代亚磺酸盐,如大蒜素,是葱属植物产生的强效拒食剂和抗菌剂,对植物致病细菌构成挑战。已表明植物致病细菌利用大蒜素耐受性()基因簇来规避这种宿主反应,导致经济上重大的产量损失。由于挖掘这些基因簇的复杂性,我们应用类似于自然语言处理的技术来分析Pfam结构域和基因邻近性。这种方法导致鉴定出新型的类似基因簇,展示了人工智能揭示难以捉摸和代表性不足的基因簇以及增强我们对其在不同细菌属中的多样性和作用的理解的潜力。