Ibtehaz Nabil, Kagaya Yuki, Kihara Daisuke
Department of Computer Science, Purdue University, West Lafayette, IN, United States.
Department of Biological Sciences, Purdue University, West Lafayette, IN, United States.
bioRxiv. 2023 Aug 24:2023.08.23.554486. doi: 10.1101/2023.08.23.554486.
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, significantly outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
结构域是蛋白质的功能和结构单元,它们控制着蛋白质执行的各种生物学功能。因此,蛋白质中结构域的表征可以作为蛋白质适当的功能表示。在这里,我们采用一种自监督协议,通过学习结构域与基因本体(GO)的共现和关联来推导功能一致的结构域表示。我们构建的结构域嵌入在执行实际功能预测任务中被证明是有效的。广泛的评估表明,在GO预测任务中,使用结构域嵌入的蛋白质表示优于大规模蛋白质语言模型。此外,基于结构域嵌入构建的新功能预测方法Domain-PFP显著优于当前最先进的功能预测器。此外,Domain-PFP在CAFA3评估中表现出有竞争力的性能,在参与评估的顶级团队中总体上取得了最佳性能。