Suppr超能文献

调整蛋白质嵌入以进行功能相似性评估。

Fine-tuning protein embeddings for functional similarity evaluation.

机构信息

Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States.

出版信息

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae445.

Abstract

MOTIVATION

Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks.

RESULTS

We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering.

AVAILABILITY AND IMPLEMENTATION

github.com/mofradlab/go_metric.

摘要

动机

具有未知功能的蛋白质经常与特征更好的相关蛋白质进行比较,无论是使用序列相似性,还是最近通过在学习的嵌入空间中的相似性。通过比较,蛋白质序列嵌入允许对蛋白质进行可解释和准确的注释,以及进行下游任务,如聚类,以进行蛋白质家族的无监督发现。然而,尚不清楚是否可以故意设计嵌入以改善它们在这些下游任务中的使用。

结果

我们发现,对于功能注释的蛋白质,如基因本体论(GO)术语所示,在简单的分类损失上对语言模型进行直接微调对蛋白质嵌入质量有直接的积极影响。微调后的嵌入在 K-最近邻分类器中作为表示的性能更强,在 GO 注释方面的性能甚至比直接可比的微调分类器更强,同时通过蛋白质相似性比较保持可解释性。它们还在相关任务中保持其质量,例如通过聚类重新发现蛋白质家族。

可及性和实现

github.com/mofradlab/go_metric。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d2eb/11299545/2f7d8f425f88/btae445f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验