Guan Jiaojiao, Ji Yongxin, Peng Cheng, Zou Wei, Tang Xubo, Shang Jiayu, Sun Yanni
Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China.
Department of Information Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong (SAR), China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf014.
Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, GOPhage surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. GOPhage can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of GOPhage by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of GOPhage to extend our understanding of newly discovered phages.
噬菌体是靶向细菌的病毒,在微生物生态学中发挥着关键作用。噬菌体蛋白对于理解噬菌体生物学,如病毒感染、复制和进化等方面很重要。尽管通过宏基因组测序已鉴定出大量新的噬菌体,但其中许多噬菌体的蛋白功能注释有限。噬菌体蛋白的准确功能注释面临若干挑战,包括其固有的多样性以及注释蛋白的稀缺性。现有工具在注释噬菌体蛋白功能时尚未充分利用噬菌体的独特特性。在这项工作中,我们通过利用噬菌体基因组的模块化基因组结构,提出了一种用于噬菌体的新蛋白功能注释工具。通过采用来自最新蛋白质基础模型的嵌入和Transformer来捕获噬菌体基因组中蛋白质之间的上下文信息,GOPhage在注释差异蛋白和具有罕见功能的蛋白方面分别比现有最先进方法提高了6.78%和13.05%。GOPhage可以注释缺乏同源性搜索结果的蛋白质,这对于表征快速积累的噬菌体基因组至关重要。我们通过在噬菌体中鉴定出688个潜在的孔蛋白来证明GOPhage的实用性,这些孔蛋白与已知孔蛋白表现出高度的结构保守性。结果显示了GOPhage在扩展我们对新发现噬菌体的理解方面的潜力。