Suppr超能文献

大语言模型提高原核病毒蛋白的注释效果。

Large language models improve annotation of prokaryotic viral proteins.

机构信息

Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.

Department of Biological Sciences, Wellesley College, Wellesley, MA, USA.

出版信息

Nat Microbiol. 2024 Feb;9(2):537-549. doi: 10.1038/s41564-023-01584-8. Epub 2024 Jan 29.

Abstract

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.

摘要

病毒基因组在宏基因组样本中的注释较差,这是了解病毒多样性和功能的障碍。目前的注释方法依赖于基于比对的序列同源性方法,但这些方法受到已鉴定的病毒蛋白数量有限和病毒序列之间的差异的限制。在这里,我们表明蛋白质语言模型可以捕获原核病毒蛋白的功能,从而能够为病毒序列空间的新部分赋予具有生物学意义的标签。当应用于全球海洋病毒组数据时,我们的分类器将病毒蛋白家族的注释比例扩大了 29%。在以前未注释的序列中,我们强调了鉴定出一种整合酶,该酶定义了海洋微微型蓝藻中的移动元件,以及一种衣壳蛋白,该蛋白固定了在全球广泛存在的病毒元件。此外,改进的高级功能注释为描述不同病毒序列在基因组组织上的相似性提供了一种方法。因此,蛋白质语言模型增强了病毒蛋白的远程同源检测,是现有方法的有益补充。

相似文献

1
Large language models improve annotation of prokaryotic viral proteins.大语言模型提高原核病毒蛋白的注释效果。
Nat Microbiol. 2024 Feb;9(2):537-549. doi: 10.1038/s41564-023-01584-8. Epub 2024 Jan 29.
4
Improving viral annotation with artificial intelligence.利用人工智能改进病毒注释。
mBio. 2024 Oct 16;15(10):e0320623. doi: 10.1128/mbio.03206-23. Epub 2024 Sep 4.
5
6
Diverse viruses of marine archaea discovered using metagenomics.利用宏基因组学发现海洋古菌的多种病毒。
Environ Microbiol. 2023 Feb;25(2):367-382. doi: 10.1111/1462-2920.16287. Epub 2022 Nov 24.
8
Expanding the marine virosphere using metagenomics.利用宏基因组学扩展海洋病毒组。
PLoS Genet. 2013;9(12):e1003987. doi: 10.1371/journal.pgen.1003987. Epub 2013 Dec 12.

引用本文的文献

本文引用的文献

4
Phage Genome Annotation: Where to Begin and End.噬菌体基因组注释:从何处开始与结束
Phage (New Rochelle). 2021 Dec 1;2(4):183-193. doi: 10.1089/phage.2021.0015. Epub 2021 Dec 16.
9
PHROG: families of prokaryotic virus proteins clustered using remote homology.PHROG:利用远缘同源性聚类的原核病毒蛋白家族。
NAR Genom Bioinform. 2021 Aug 5;3(3):lqab067. doi: 10.1093/nargab/lqab067. eCollection 2021 Sep.
10
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验