微调蛋白质语言模型可释放未充分表征的病毒蛋白质组的潜力。

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes.

作者信息

Sawhney Rajan, Ferrell Barbra, Dejean Thibaut, Schreiber Zachary, Harrigan William, Polson Shawn W, Wommack K Eric, Belcaid Mahdi

机构信息

Department of Information and Computer Sciences, University of Hawai'i at Manoa.

Department of Computer and Information Sciences, University of Delaware.

出版信息

bioRxiv. 2025 Jun 11:2025.04.17.649224. doi: 10.1101/2025.04.17.649224.

DOI:10.1101/2025.04.17.649224

PMID:40661509

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12258895/

Abstract

Protein language models (pLMs) have revolutionized computational biology by generating rich protein vector representations-or embeddings-enabling major advancements in de novo protein design, structure prediction, variant effect analysis, and evolutionary studies. Despite these breakthroughs, current pLMs often exhibit biases against proteins from underrepresented species, with viral proteins being particularly affected-frequently referred to as the "dark matter" of the biological world due to their vast diversity and ubiquity, yet sparse representation in training datasets. Here, we show that fine-tuning pre-trained pLMs on viral protein sequences, using diverse learning frameworks and parameter-efficient strategies, significantly enhances representation quality and improves performance on downstream tasks. To support further research, we provide source code for fine-tuning pLMs and benchmarking embedding quality. By enabling more accurate modeling of viral proteins, our approach advances tools for understanding viral biology, combating emerging infectious diseases, and driving biotechnological innovation.

摘要

蛋白质语言模型（pLMs）通过生成丰富的蛋白质向量表示（即嵌入），彻底改变了计算生物学，推动了从头蛋白质设计、结构预测、变异效应分析和进化研究等领域的重大进展。尽管取得了这些突破，但当前的pLMs常常对来自代表性不足物种的蛋白质存在偏见，病毒蛋白受到的影响尤为明显——由于其种类繁多且无处不在，它们常被称为生物界的“暗物质”，但在训练数据集中的表示却很稀疏。在这里，我们表明，使用不同的学习框架和参数高效策略，在病毒蛋白序列上微调预训练的pLMs，可显著提高表示质量，并改善下游任务的性能。为支持进一步研究，我们提供了用于微调pLMs和基准测试嵌入质量的源代码。通过实现对病毒蛋白更准确的建模，我们的方法推动了用于理解病毒生物学、对抗新发传染病和推动生物技术创新的工具发展。