Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae445.
Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks.
We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering.
github.com/mofradlab/go_metric.
具有未知功能的蛋白质经常与特征更好的相关蛋白质进行比较,无论是使用序列相似性,还是最近通过在学习的嵌入空间中的相似性。通过比较,蛋白质序列嵌入允许对蛋白质进行可解释和准确的注释,以及进行下游任务,如聚类,以进行蛋白质家族的无监督发现。然而,尚不清楚是否可以故意设计嵌入以改善它们在这些下游任务中的使用。
我们发现,对于功能注释的蛋白质,如基因本体论(GO)术语所示,在简单的分类损失上对语言模型进行直接微调对蛋白质嵌入质量有直接的积极影响。微调后的嵌入在 K-最近邻分类器中作为表示的性能更强,在 GO 注释方面的性能甚至比直接可比的微调分类器更强,同时通过蛋白质相似性比较保持可解释性。它们还在相关任务中保持其质量,例如通过聚类重新发现蛋白质家族。
github.com/mofradlab/go_metric。