Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.
Data Science, Life Science Operations Department, Elsevier B.V., Amsterdam, the Netherlands.
PLoS One. 2022 Jul 13;17(7):e0271395. doi: 10.1371/journal.pone.0271395. eCollection 2022.
Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines: (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.
全基因组关联研究 (GWAS) 已经确定了许多单核苷酸多态性 (SNPs),它们在性状和疾病的遗传易感性中起着重要作用。由于大多数这些 SNPs 位于基因组的非编码部分,目前假设这些 SNPs 影响基因组上附近基因的表达。然而,确定这些与疾病相关的 SNPs 靶向哪些基因仍然具有挑战性。过去,蛋白质知识图谱经常被用于识别与疾病相关的基因,也称为“疾病基因”。在这里,我们通过测试和比较六个现有蛋白质知识图谱方法的性能来探索蛋白质知识图谱是否可以用于识别与疾病相关的非编码 SNPs 靶向的基因,其中四个方法是为疾病基因识别而开发的。我们将我们的性能与两个基线进行比较:(1)一种基于关联有罪的现有最先进方法,(2)最主要的假设,即 SNPs 靶向基因组上最近的基因。我们使用四个参考集测试这些方法,其中三个是通过不同的方式获得的。此外,我们结合方法来研究它们的组合是否可以提高性能。我们发现,包含谓词信息的蛋白质知识图谱与当前最先进的技术表现相当,在所有四个参考集上平均达到了接收器操作特征曲线 (ROC) 下面积 (AUC) 的 79.6%。缺乏谓词信息的蛋白质知识图谱与我们的另一个基线(遗传距离)相当,在所有四个参考集上的 AUC 为 75.7%。组合多种方法可将性能提高到 84.9% AUC。我们得出结论,蛋白质知识图谱的方法可用于识别与疾病相关的非编码 SNPs 靶向的基因。