Suppr超能文献

使用蛋白质大语言模型的语义搜索可在细菌基因组中检测出II类微菌素。

Semantic search using protein large language models detects class II microcins in bacterial genomes.

作者信息

Kulikova Anastasiya V, Parker Jennifer K, Davies Bryan W, Wilke Claus O

机构信息

Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA.

Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA.

出版信息

bioRxiv. 2023 Nov 15:2023.11.15.567263. doi: 10.1101/2023.11.15.567263.

Abstract

Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date only ten class II microcins have been described, and discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In datasets of , spp., and spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.

摘要

II类微小菌素是一类抗菌肽,已显示出作为新型抗生素的一些潜力。然而,迄今为止,仅描述了10种II类微小菌素,新型微小菌素的发现受到其长度短和序列高度分化的阻碍。在这里,我们探讨是否可以使用蛋白质大语言模型生成的数值嵌入来检测细菌基因组组装中的微小菌素,以及该方法是否优于基于序列的方法(如BLAST)。我们发现,嵌入检测已知II类微小菌素的可靠性远高于BLAST,并且任何两种微小菌素在嵌入空间中的距离往往较小,即使它们在序列水平上通常高度分化。在大肠杆菌、肺炎克雷伯菌和铜绿假单胞菌基因组数据集中,我们进一步发现了基于序列的搜索方法之前遗漏的新型假定微小菌素。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4e5/10680697/a57234441280/nihpp-2023.11.15.567263v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验