Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), China.
Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad408.
Bacteriophages (phages for short), which prey on and replicate within bacterial cells, have a significant role in modulating microbial communities and hold potential applications in treating antibiotic resistance. The advancement of high-throughput sequencing technology contributes to the discovery of phages tremendously. However, the taxonomic classification of assembled phage contigs still faces several challenges, including high genetic diversity, lack of a stable taxonomy system and limited knowledge of phage annotations. Despite extensive efforts, existing tools have not yet achieved an optimal balance between prediction rate and accuracy.
In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. We tested PhaGenus on four datasets in different scenarios. The experimental results show that PhaGenus outperforms state-of-the-art methods in predicting low-similarity datasets, achieving an improvement of at least 13.7%. Additionally, PhaGenus is highly effective at identifying previously uncharacterized genera that are not represented in reference databases, with an improvement of 8.52%. The analysis of the infants' gut and GOV2.0 dataset demonstrates that PhaGenus can be used to classify more contigs with higher accuracy.
噬菌体(简称 phage)是一种寄生于细菌细胞内并进行复制的病毒,在调节微生物群落方面具有重要作用,并在治疗抗生素耐药性方面具有潜在应用。高通量测序技术的进步极大地促进了噬菌体的发现。然而,组装的噬菌体 contigs 的分类学分类仍然面临着几个挑战,包括遗传多样性高、缺乏稳定的分类系统和噬菌体注释知识有限。尽管付出了广泛的努力,但现有的工具尚未在预测率和准确性之间达到最佳平衡。
在这项工作中,我们开发了一种名为 PhaGenus 的基于学习的模型,用于对噬菌体 contigs 进行属水平的分类学分类。PhaGenus 利用强大的 Transformer 模型来学习蛋白质簇之间的关联,并支持多达 508 个属的分类。我们在四个不同场景的数据集上测试了 PhaGenus。实验结果表明,PhaGenus 在预测低相似度数据集方面优于最先进的方法,至少提高了 13.7%。此外,PhaGenus 在识别以前未表征的、不在参考数据库中表示的属方面非常有效,提高了 8.52%。对婴儿肠道和 GOV2.0 数据集的分析表明,PhaGenus 可以用于更高精度地分类更多的 contigs。