Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA.
Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, New York, USA.
mBio. 2024 Oct 16;15(10):e0320623. doi: 10.1128/mbio.03206-23. Epub 2024 Sep 4.
Viruses of bacteria, "phages," are fundamental, poorly understood components of microbial community structure and function. Additionally, their dependence on hosts for replication positions phages as unique sensors of ecosystem features and environmental pressures. High-throughput sequencing approaches have begun to give us access to the diversity and range of phage populations in complex microbial community samples, and metagenomics is currently the primary tool with which we study phage populations. The study of phages by metagenomic sequencing, however, is fundamentally limited by viral diversity, which results in the vast majority of viral genomes and metagenome-annotated genomes lacking annotation. To harness bacteriophages for applications in human and environmental health and disease, we need new methods to organize and annotate viral sequence diversity. We recently demonstrated that methods that leverage self-supervised representation learning can supplement statistical sequence representations for remote viral protein homology detection in the ocean virome and propose that consideration of the functional content of viral sequences allows for the identification of similarity in otherwise sequence-diverse viruses and viral-like elements for biological discovery. In this review, we describe the potential and pitfalls of large language models for viral annotation. We describe the need for new approaches to annotate viral sequences in metagenomes, the fundamentals of what protein language models are and how one can use them for sequence annotation, the strengths and weaknesses of these models, and future directions toward developing better models for viral annotation more broadly.
细菌病毒,即“噬菌体”,是微生物群落结构和功能的基本组成部分,但其作用尚未被充分理解。此外,噬菌体的复制依赖于宿主,这使它们成为独特的生态系统特征和环境压力的感应器。高通量测序方法已经开始让我们能够深入了解复杂微生物群落样本中噬菌体的多样性和范围,而宏基因组学目前是我们研究噬菌体种群的主要工具。然而,通过宏基因组测序研究噬菌体受到病毒多样性的根本限制,这导致绝大多数病毒基因组和宏基因组注释基因组缺乏注释。为了利用噬菌体在人类和环境健康和疾病方面的应用,我们需要新的方法来组织和注释病毒序列多样性。我们最近证明,利用自我监督表示学习的方法可以补充海洋病毒组中远程病毒蛋白同源性检测的统计序列表示,并提出考虑病毒序列的功能内容可以识别在其他方面序列不同的病毒和病毒样元件,以进行生物发现。在这篇综述中,我们描述了大型语言模型在病毒注释方面的潜力和陷阱。我们描述了在宏基因组中注释病毒序列的新方法的必要性,介绍了蛋白质语言模型的基本原理以及如何将其用于序列注释,讨论了这些模型的优缺点,以及更广泛地开发更好的病毒注释模型的未来方向。