National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, NY 10016, USA.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae017.
BACKGROUND: Phage therapy, reemerging as a promising approach to counter antimicrobial-resistant infections, relies on a comprehensive understanding of the specificity of individual phages. Yet the significant diversity within phage populations presents a considerable challenge. Currently, there is a notable lack of tools designed for large-scale characterization of phage receptor-binding proteins, which are crucial in determining the phage host range. RESULTS: In this study, we present SpikeHunter, a deep learning method based on the ESM-2 protein language model. With SpikeHunter, we identified 231,965 diverse phage-encoded tailspike proteins, a crucial determinant of phage specificity that targets bacterial polysaccharide receptors, across 787,566 bacterial genomes from 5 virulent, antibiotic-resistant pathogens. Notably, 86.60% (143,200) of these proteins exhibited strong associations with specific bacterial polysaccharides. We discovered that phages with identical tailspike proteins can infect different bacterial species with similar polysaccharide receptors, underscoring the pivotal role of tailspike proteins in determining host range. The specificity is mainly attributed to the protein's C-terminal domain, which strictly correlates with host specificity during domain swapping in tailspike proteins. Importantly, our dataset-driven predictions of phage-host specificity closely match the phage-host pairs observed in real-world phage therapy cases we studied. CONCLUSIONS: Our research provides a rich resource, including both the method and a database derived from a large-scale genomics survey. This substantially enhances understanding of phage specificity determinants at the strain level and offers a valuable framework for guiding phage selection in therapeutic applications.
背景:噬菌体疗法作为一种有前途的对抗抗微生物药物耐药性感染的方法重新出现,依赖于对个体噬菌体特异性的全面了解。然而,噬菌体群体内的巨大多样性带来了相当大的挑战。目前,缺乏用于大规模表征噬菌体受体结合蛋白的工具,而这些蛋白对于确定噬菌体宿主范围至关重要。
结果:在这项研究中,我们提出了 SpikeHunter,这是一种基于 ESM-2 蛋白语言模型的深度学习方法。使用 SpikeHunter,我们在 5 种毒性、抗抗生素的病原体的 787566 个细菌基因组中,鉴定了 231965 种不同的噬菌体编码的尾丝蛋白,这是噬菌体特异性的关键决定因素,针对细菌多糖受体。值得注意的是,这些蛋白中有 86.60%(143200)与特定的细菌多糖表现出强烈的关联。我们发现,具有相同尾丝蛋白的噬菌体可以感染具有相似多糖受体的不同细菌物种,这突显了尾丝蛋白在决定宿主范围方面的关键作用。特异性主要归因于蛋白质的 C 末端结构域,该结构域在尾丝蛋白的结构域交换过程中与宿主特异性严格相关。重要的是,我们基于数据集的噬菌体-宿主特异性预测与我们研究的实际噬菌体治疗案例中的噬菌体-宿主对密切匹配。
结论:我们的研究提供了一个丰富的资源,包括一种方法和一个源自大规模基因组学调查的数据库。这极大地增强了对噬菌体特异性决定因素在菌株水平上的理解,并为治疗应用中的噬菌体选择提供了有价值的框架。
Microlife. 2025-8-11
Brief Funct Genomics. 2025-1-15
Nat Commun. 2024-5-22
Int J Mol Sci. 2023-5-22
BMC Bioinformatics. 2023-5-19
Cell Rep. 2023-2-28
Curr Opin Microbiol. 2023-2