Suppr超能文献

利用生物诱导的稀疏注意力将蛋白质语言模型扩展到病毒基因组规模

Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention.

作者信息

Dejean Thibaut, Ferrell Barbra D, Harrigan William, Schreiber Zachary D, Sawhney Rajan, Wommack K Eric, Polson Shawn W, Belcaid Mahdi

机构信息

Department of Information and Computer Sciences, University of Hawaii, Honolulu, HI.

Department of Computer & Information Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE.

出版信息

bioRxiv. 2025 Jun 11:2025.05.29.656907. doi: 10.1101/2025.05.29.656907.

Abstract

The transformer architecture in deep learning has revolutionized protein sequence analysis. Recent advancements in protein language models have paved the way for significant progress across various domains, including protein function and structure prediction, multiple sequence alignments and mutation effect prediction. A protein language model is commonly trained on individual proteins, ignoring the interdependencies between sequences within a genome. However, biological understanding reveals that protein-protein interactions span entire genomic regions, underscoring the limitations of focusing solely on individual proteins. To address these limitations, we propose a novel approach that extends the context size of transformer models across the entire viral genome. By training on large genomic fragments, our method captures long-range interprotein interactions and encodes protein sequences with integrated information from distant proteins within the same genome, offering substantial benefits in various tasks. Viruses, with their densely packed genomes, minimal intergenic regions, and protein annotation challenges, are ideal candidates for genome-wide learning. We introduce a long-context protein language model, trained on entire viral genomes, leveraging a sparse attention mechanism based on protein-protein interactions. Our semi-supervised approach supports long sequences of up to 61,000 amino acids (aa). Our evaluations demonstrate that the resulting embeddings significantly surpass those generated by single-protein models and outperform alternative large-context architectures that rely on static masking or non-transformer frameworks.

摘要

深度学习中的Transformer架构彻底改变了蛋白质序列分析。蛋白质语言模型的最新进展为包括蛋白质功能和结构预测、多序列比对以及突变效应预测在内的各个领域取得重大进展铺平了道路。蛋白质语言模型通常是针对单个蛋白质进行训练的,忽略了基因组内序列之间的相互依赖性。然而,生物学认识表明蛋白质-蛋白质相互作用跨越整个基因组区域,这凸显了仅关注单个蛋白质的局限性。为了解决这些局限性,我们提出了一种新颖的方法,该方法将Transformer模型的上下文大小扩展到整个病毒基因组。通过对大型基因组片段进行训练,我们的方法捕获了远距离蛋白质间相互作用,并利用来自同一基因组中远距离蛋白质的整合信息对蛋白质序列进行编码,在各种任务中都带来了显著优势。病毒具有紧密排列的基因组、最小的基因间区域以及蛋白质注释挑战,是全基因组学习的理想候选对象。我们引入了一种长上下文蛋白质语言模型,该模型基于蛋白质-蛋白质相互作用的稀疏注意力机制,在整个病毒基因组上进行训练。我们的半监督方法支持长达61,000个氨基酸(aa)的长序列。我们的评估表明,由此产生的嵌入显著优于单蛋白质模型生成的嵌入,并且优于依赖静态掩码或非Transformer框架的替代大上下文架构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a6d2/12169145/02ecba0cbf5f/nihpp-2025.05.29.656907v2-f0002.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验