Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.
Department of Biological Sciences, Wellesley College, Wellesley, MA, USA.
Nat Microbiol. 2024 Feb;9(2):537-549. doi: 10.1038/s41564-023-01584-8. Epub 2024 Jan 29.
Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.
病毒基因组在宏基因组样本中的注释较差,这是了解病毒多样性和功能的障碍。目前的注释方法依赖于基于比对的序列同源性方法,但这些方法受到已鉴定的病毒蛋白数量有限和病毒序列之间的差异的限制。在这里,我们表明蛋白质语言模型可以捕获原核病毒蛋白的功能,从而能够为病毒序列空间的新部分赋予具有生物学意义的标签。当应用于全球海洋病毒组数据时,我们的分类器将病毒蛋白家族的注释比例扩大了 29%。在以前未注释的序列中,我们强调了鉴定出一种整合酶,该酶定义了海洋微微型蓝藻中的移动元件,以及一种衣壳蛋白,该蛋白固定了在全球广泛存在的病毒元件。此外,改进的高级功能注释为描述不同病毒序列在基因组组织上的相似性提供了一种方法。因此,蛋白质语言模型增强了病毒蛋白的远程同源检测,是现有方法的有益补充。