Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, United States of America.
Institute of Bioinformatics, University of Georgia, Athens, GA, United States of America.
PeerJ. 2023 Oct 18;11:e15815. doi: 10.7717/peerj.15815. eCollection 2023.
The 534 protein kinases encoded in the human genome constitute a large druggable class of proteins that include both well-studied and understudied "dark" members. Accurate prediction of dark kinase functions is a major bioinformatics challenge. Here, we employ a graph mining approach that uses the evolutionary and functional context encoded in knowledge graphs (KGs) to predict protein and pathway associations for understudied kinases. We propose a new scalable graph embedding approach, RegPattern2Vec, which employs regular pattern constrained random walks to sample diverse aspects of node context within a KG flexibly. RegPattern2Vec learns functional representations of kinases, interacting partners, post-translational modifications, pathways, cellular localization, and chemical interactions from a kinase-centric KG that integrates and conceptualizes data from curated heterogeneous data resources. By contextualizing information relevant to prediction, RegPattern2Vec improves accuracy and efficiency in comparison to other random walk-based graph embedding approaches. We show that the predictions produced by our model overlap with pathway enrichment data produced using experimentally validated Protein-Protein Interaction (PPI) data from both publicly available databases and experimental datasets not used in training. Our model also has the advantage of using the collected random walks as biological context to interpret the predicted protein-pathway associations. We provide high-confidence pathway predictions for 34 dark kinases and present three case studies in which analysis of meta-paths associated with the prediction enables biological interpretation. Overall, RegPattern2Vec efficiently samples multiple node types for link prediction on biological knowledge graphs and the predicted associations between understudied kinases, pseudokinases, and known pathways serve as a conceptual starting point for hypothesis generation and testing.
人类基因组中编码的 534 种蛋白激酶构成了一个庞大的可成药蛋白类,其中包括研究充分和研究不足的“暗”激酶成员。准确预测暗激酶的功能是一个主要的生物信息学挑战。在这里,我们采用了一种图挖掘方法,该方法使用知识图(KG)中编码的进化和功能上下文来预测研究不足的激酶的蛋白质和途径关联。我们提出了一种新的可扩展图嵌入方法 RegPattern2Vec,它使用正则模式约束的随机游走灵活地在 KG 中采样节点上下文的多个方面。RegPattern2Vec 从以激酶为中心的 KG 中学习激酶、相互作用伙伴、翻译后修饰、途径、细胞定位和化学相互作用的功能表示,该 KG 集成和概念化了来自精心策划的异构数据资源的数据。通过将与预测相关的信息置于上下文中,RegPattern2Vec 与其他基于随机游走的图嵌入方法相比,提高了准确性和效率。我们表明,我们的模型产生的预测与使用来自公共可用数据库和未用于训练的实验数据集的实验验证的蛋白质-蛋白质相互作用(PPI)数据生成的途径富集数据重叠。我们的模型还具有使用收集的随机游走作为生物背景来解释预测的蛋白质-途径关联的优势。我们为 34 种暗激酶提供了高可信度的途径预测,并提供了三个案例研究,其中与预测相关的元路径的分析使生物学解释成为可能。总体而言,RegPattern2Vec 可以有效地对生物知识图上的多个节点类型进行链接预测,并且对研究不足的激酶、假激酶和已知途径之间的预测关联可以作为生成和测试假说的概念起点。