Kulikova Anastasiya V, Parker Jennifer K, Davies Bryan W, Wilke Claus O
Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA.
Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA.
bioRxiv. 2023 Nov 15:2023.11.15.567263. doi: 10.1101/2023.11.15.567263.
Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date only ten class II microcins have been described, and discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In datasets of , spp., and spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.
II类微小菌素是一类抗菌肽,已显示出作为新型抗生素的一些潜力。然而,迄今为止,仅描述了10种II类微小菌素,新型微小菌素的发现受到其长度短和序列高度分化的阻碍。在这里,我们探讨是否可以使用蛋白质大语言模型生成的数值嵌入来检测细菌基因组组装中的微小菌素,以及该方法是否优于基于序列的方法(如BLAST)。我们发现,嵌入检测已知II类微小菌素的可靠性远高于BLAST,并且任何两种微小菌素在嵌入空间中的距离往往较小,即使它们在序列水平上通常高度分化。在大肠杆菌、肺炎克雷伯菌和铜绿假单胞菌基因组数据集中,我们进一步发现了基于序列的搜索方法之前遗漏的新型假定微小菌素。