Yang Yiyan, Dufault-Thompson Keith, Yan Wei, Cai Tian, Xie Lei, Jiang Xiaofang
National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, NY 10016, USA.
bioRxiv. 2023 Jun 16:2023.06.16.545366. doi: 10.1101/2023.06.16.545366.
Phage tailspike proteins are depolymerases that target diverse bacterial surface glycans with high specificity, determining the host-specificity of numerous phages. To address the challenge of identifying tailspike proteins due to their sequence diversity, we developed SpikeHunter, an approach based on the ESM-2 protein language model. Using SpikeHunter, we successfully identified 231,965 tailspike proteins from a dataset comprising 8,434,494 prophages found within 165,365 genomes of five common pathogens. Among these proteins, 143,035 tailspike proteins displayed strong associations with serotypes. Moreover, we observed highly similar tailspike proteins in species that share closely related serotypes. We found extensive domain swapping in all five species, with the C-terminal domain being significantly associated with host serotype highlighting its role in host range determination. Our study presents a comprehensive cross-species analysis of tailspike protein to serotype associations, providing insights applicable to phage therapy and biotechnology.
噬菌体尾刺蛋白是一种解聚酶,能高度特异性地靶向多种细菌表面聚糖,决定了众多噬菌体的宿主特异性。由于尾刺蛋白的序列多样性,给识别工作带来了挑战,为应对这一挑战,我们开发了SpikeHunter,这是一种基于ESM-2蛋白质语言模型的方法。使用SpikeHunter,我们成功地从一个数据集中鉴定出231,965个尾刺蛋白,该数据集包含在五种常见病原体的165,365个基因组中发现的8,434,494个原噬菌体。在这些蛋白质中,143,035个尾刺蛋白与血清型有强烈关联。此外,我们在具有密切相关血清型的物种中观察到高度相似的尾刺蛋白。我们发现所有五个物种中都存在广泛的结构域交换,C末端结构域与宿主血清型显著相关,突出了其在宿主范围确定中的作用。我们的研究对尾刺蛋白与血清型的关联进行了全面的跨物种分析,为噬菌体治疗和生物技术提供了可应用的见解。