Suppr超能文献

蛋白质集变换器:一种为高多样性病毒组学提供支持的基于蛋白质的基因组语言模型。

Protein Set Transformer: A protein-based genome language model to power high diversity viromics.

作者信息

Martin Cody, Gitter Anthony, Anantharaman Karthik

机构信息

Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA.

Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA.

出版信息

Res Sq. 2024 Sep 23:rs.3.rs-4844047. doi: 10.21203/rs.3.rs-4844047/v1.

Abstract

Exponential increases in microbial and viral genomic data demand transformational advances in scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of microbial and especially viral genomes and proteins that significantly decreases the volume of usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model that models genomes as sets of proteins without considering sparsely available functional labels. Trained on >100k viruses, PST outperformed other homology- and language model-based approaches for relating viral genomes based on shared protein content. Further, PST demonstrated protein structural and functional awareness by clustering capsid-fold-containing proteins with known capsid proteins and uniquely clustering late gene proteins within related viruses. Our data establish PST as a valuable method for diverse viral genomics, ecology, and evolutionary applications. We posit that the PST framework can be a foundation model for microbial genomics when trained on suitable data.

摘要

微生物和病毒基因组数据呈指数级增长,这就要求在可扩展、通用的解释框架方面取得变革性进展。基于同源性的标准功能分析受到微生物尤其是病毒基因组和蛋白质快速分化的阻碍,这显著减少了可用数据量。在此,我们提出了蛋白质集变换器(PST),这是一种基于蛋白质的基因组语言模型,它将基因组建模为蛋白质集,而不考虑稀疏可用的功能标签。PST在超过10万个病毒上进行训练,在基于共享蛋白质含量关联病毒基因组方面,其表现优于其他基于同源性和语言模型的方法。此外,PST通过将含衣壳折叠的蛋白质与已知衣壳蛋白聚类,并在相关病毒中独特地聚类晚期基因蛋白,展示了蛋白质结构和功能意识。我们的数据表明,PST是用于多种病毒基因组学、生态学和进化应用的有价值方法。我们认为,当在合适的数据上进行训练时,PST框架可以成为微生物基因组学的基础模型。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验