通用情境化蛋白质嵌入在跨物种蛋白质功能预测中的作用

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction.

作者信息

van den Bent Irene, Makrodimitris Stavros, Reinders Marcel

机构信息

Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands.

Keygene N.V., Wageningen, the Netherlands.

出版信息

Evol Bioinform Online. 2021 Dec 3;17:11769343211062608. doi: 10.1177/11769343211062608. eCollection 2021.

DOI:10.1177/11769343211062608

PMID:34880594

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8647222/

Abstract

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

摘要

通过计算为蛋白质标注分子功能是一个难题，由于可用的带标签蛋白质训练数据量有限，这个问题变得更加困难。无监督蛋白质嵌入通过从许多未标记序列中学习通用蛋白质表示，部分规避了这一限制。此类嵌入纳入了氨基酸的上下文信息，从而对蛋白质序列的潜在原理进行建模，而不受物种上下文的影响。我们使用了一种现有的预训练蛋白质嵌入方法，并对其分子功能预测性能进行了详细表征，一是为了增进对蛋白质语言模型的理解，二是为了确定改进的方向。然后，我们在一个迁移学习任务中应用该模型，方法是基于一个训练物种的带注释蛋白质序列的嵌入训练一个功能预测器，并对具有不同进化距离的几个测试物种的蛋白质进行预测。我们表明，这种方法成功地将关于蛋白质功能的知识从一个真核物种推广到其他各种物种，优于基于比对和基于监督学习的基线方法。这意味着这种方法对于来自研究不足的分类界中注释不充分的物种的分子功能预测可能是有效的。