文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

微调蛋白质语言模型可释放未充分表征的病毒蛋白质组的潜力。

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes.

作者信息

Sawhney Rajan, Ferrell Barbra, Dejean Thibaut, Schreiber Zachary, Harrigan William, Polson Shawn W, Wommack K Eric, Belcaid Mahdi

机构信息

Department of Information and Computer Sciences, University of Hawai'i at Manoa.

Department of Computer and Information Sciences, University of Delaware.

出版信息

bioRxiv. 2025 Jun 11:2025.04.17.649224. doi: 10.1101/2025.04.17.649224.


DOI:10.1101/2025.04.17.649224
PMID:40661509
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12258895/
Abstract

Protein language models (pLMs) have revolutionized computational biology by generating rich protein vector representations-or embeddings-enabling major advancements in de novo protein design, structure prediction, variant effect analysis, and evolutionary studies. Despite these breakthroughs, current pLMs often exhibit biases against proteins from underrepresented species, with viral proteins being particularly affected-frequently referred to as the "dark matter" of the biological world due to their vast diversity and ubiquity, yet sparse representation in training datasets. Here, we show that fine-tuning pre-trained pLMs on viral protein sequences, using diverse learning frameworks and parameter-efficient strategies, significantly enhances representation quality and improves performance on downstream tasks. To support further research, we provide source code for fine-tuning pLMs and benchmarking embedding quality. By enabling more accurate modeling of viral proteins, our approach advances tools for understanding viral biology, combating emerging infectious diseases, and driving biotechnological innovation.

摘要

蛋白质语言模型(pLMs)通过生成丰富的蛋白质向量表示(即嵌入),彻底改变了计算生物学,推动了从头蛋白质设计、结构预测、变异效应分析和进化研究等领域的重大进展。尽管取得了这些突破,但当前的pLMs常常对来自代表性不足物种的蛋白质存在偏见,病毒蛋白受到的影响尤为明显——由于其种类繁多且无处不在,它们常被称为生物界的“暗物质”,但在训练数据集中的表示却很稀疏。在这里,我们表明,使用不同的学习框架和参数高效策略,在病毒蛋白序列上微调预训练的pLMs,可显著提高表示质量,并改善下游任务的性能。为支持进一步研究,我们提供了用于微调pLMs和基准测试嵌入质量的源代码。通过实现对病毒蛋白更准确的建模,我们的方法推动了用于理解病毒生物学、对抗新发传染病和推动生物技术创新的工具发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/ffd58285ef6a/nihpp-2025.04.17.649224v3-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/49404e4f375e/nihpp-2025.04.17.649224v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/c7f5cbeadd2c/nihpp-2025.04.17.649224v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/f68b39809540/nihpp-2025.04.17.649224v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/9ff314f77df3/nihpp-2025.04.17.649224v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/2b38d776338f/nihpp-2025.04.17.649224v3-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/85fd5b6fb775/nihpp-2025.04.17.649224v3-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/9b8ae84a4442/nihpp-2025.04.17.649224v3-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/ffd58285ef6a/nihpp-2025.04.17.649224v3-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/49404e4f375e/nihpp-2025.04.17.649224v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/c7f5cbeadd2c/nihpp-2025.04.17.649224v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/f68b39809540/nihpp-2025.04.17.649224v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/9ff314f77df3/nihpp-2025.04.17.649224v3-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/2b38d776338f/nihpp-2025.04.17.649224v3-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/85fd5b6fb775/nihpp-2025.04.17.649224v3-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/9b8ae84a4442/nihpp-2025.04.17.649224v3-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0511/12258895/ffd58285ef6a/nihpp-2025.04.17.649224v3-f0009.jpg

相似文献

[1]
Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes.

bioRxiv. 2025-6-11

[2]
Boost Protein Language Model with Injected Structure Information Through Parameter Efficient Fine-tuning.

Comput Biol Med. 2025-9

[3]
Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records.

JMIR AI. 2024-8-6

[4]
Enhancing Structure-Aware Protein Language Models with Efficient Fine-Tuning for Various Protein Prediction Tasks.

Methods Mol Biol. 2025

[5]
Improving Suicidal Ideation Detection in Social Media Posts: Topic Modeling and Synthetic Data Augmentation Approach.

JMIR Form Res. 2025-6-11

[6]
Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.

J Med Internet Res. 2025-7-8

[7]
Large Language Model (LLM)-Based Advances in Prediction of Post-translational Modification Sites in Proteins.

Methods Mol Biol. 2025

[8]
A dataset and benchmark for hospital course summarization with adapted large language models.

J Am Med Inform Assoc. 2025-3-1

[9]
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025-6-18

[10]
Advancing the Accuracy of Anti-MRSA Peptide Prediction Through Integrating Multi-Source Protein Language Models.

Interdiscip Sci. 2025-3-11

本文引用的文献

[1]
Do protein language models learn phylogeny?

Brief Bioinform. 2024-11-22

[2]
UniProt: the Universal Protein Knowledgebase in 2025.

Nucleic Acids Res. 2025-1-6

[3]
Improving viral annotation with artificial intelligence.

mBio. 2024-10-16

[4]
VOGDB-Database of Virus Orthologous Groups.

Viruses. 2024-7-25

[5]
A pangenome analysis of ESKAPE bacteriophages: the underrepresentation may impact machine learning models.

Front Mol Biosci. 2024-6-21

[6]
Language models can identify enzymatic binding sites in protein sequences.

Comput Struct Biotechnol J. 2024-4-30

[7]
Accurate structure prediction of biomolecular interactions with AlphaFold 3.

Nature. 2024-6

[8]
Improvements in viral gene annotation using large language models and soft alignments.

BMC Bioinformatics. 2024-4-25

[9]
The EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024.

Nucleic Acids Res. 2024-7-5

[10]
ProGen2: Exploring the boundaries of protein language models.

Cell Syst. 2023-11-15

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索