Suppr超能文献

基于蛋白质大型语言模型的语义搜索可在细菌基因组中检测到 II 类微菌素。

Semantic search using protein large language models detects class II microcins in bacterial genomes.

机构信息

Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, USA.

Department of Molecular Biosciences, The University of Texas at Austin, Austin, Texas, USA.

出版信息

mSystems. 2024 Oct 22;9(10):e0104424. doi: 10.1128/msystems.01044-24. Epub 2024 Sep 18.

Abstract

UNLABELLED

Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date, only 10 class II microcins have been described, and the discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In data sets of , spp., and spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.

IMPORTANCE

Antibiotic resistance is becoming an increasingly serious problem in modern medicine, but the development pipeline for conventional antibiotics is not promising. Therefore, alternative approaches to combat bacterial infections are urgently needed. One such approach may be to employ naturally occurring antibacterial peptides produced by bacteria to kill competing bacteria. A promising class of such peptides are class II microcins. However, only a small number of class II microcins have been discovered to date, and the discovery of further such microcins has been hampered by their high sequence divergence and short length, which can cause sequence-based search methods to fail. Here, we demonstrate that a more robust method for microcin discovery can be built on the basis of a protein large language model, and we use this method to identify several putative novel class II microcins.

摘要

未加标签

II 类微菌素是具有抗菌活性的肽,具有成为新型抗生素的潜力。然而,迄今为止,仅描述了 10 种 II 类微菌素,新型微菌素的发现受到其短长度和高序列差异的阻碍。在这里,我们询问是否可以使用蛋白质大型语言模型生成的数字嵌入来检测细菌基因组组装中的微菌素,以及这种方法是否可以优于 BLAST 等基于序列的方法。我们发现,与 BLAST 相比,嵌入更可靠地检测到已知的 II 类微菌素,并且即使在序列水平上通常高度分化,两个微菌素往往在嵌入空间中具有较小的距离。在 、 种和 种基因组的数据集,我们进一步发现了以前被基于序列的搜索方法遗漏的新的推定微菌素。

重要性

抗生素耐药性在现代医学中已成为一个日益严重的问题,但传统抗生素的开发渠道并不乐观。因此,迫切需要寻找对抗细菌感染的替代方法。一种这样的方法可能是利用细菌产生的天然抗菌肽来杀死竞争细菌。此类肽的一个有前途的类别是 II 类微菌素。然而,迄今为止,仅发现了少量的 II 类微菌素,并且由于其高度的序列差异和短长度,进一步发现此类微菌素受到阻碍,这可能导致基于序列的搜索方法失败。在这里,我们证明可以在蛋白质大型语言模型的基础上构建更可靠的微菌素发现方法,我们使用这种方法鉴定了几种推定的新型 II 类微菌素。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bc7/11494933/f8b18adc3263/msystems.01044-24.f001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验