利用蛋白质知识图谱鉴定与疾病相关的非编码 SNPs 靶向的基因。

Identifying genes targeted by disease-associated non-coding SNPs with a protein knowledge graph.

机构信息

Department of Medical Informatics, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.

Data Science, Life Science Operations Department, Elsevier B.V., Amsterdam, the Netherlands.

出版信息

PLoS One. 2022 Jul 13;17(7):e0271395. doi: 10.1371/journal.pone.0271395. eCollection 2022.

DOI:10.1371/journal.pone.0271395

PMID:35830458

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9278741/

Abstract

Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) that play important roles in the genetic heritability of traits and diseases. With most of these SNPs located on the non-coding part of the genome, it is currently assumed that these SNPs influence the expression of nearby genes on the genome. However, identifying which genes are targeted by these disease-associated SNPs remains challenging. In the past, protein knowledge graphs have often been used to identify genes that are associated with disease, also referred to as "disease genes". Here, we explore whether protein knowledge graphs can be used to identify genes that are targeted by disease-associated non-coding SNPs by testing and comparing the performance of six existing methods for a protein knowledge graph, four of which were developed for disease gene identification. We compare our performance against two baselines: (1) an existing state-of-the-art method that is based on guilt-by-association, and (2) the leading assumption that SNPs target the nearest gene on the genome. We test these methods with four reference sets, three of which were obtained by different means. Furthermore, we combine methods to investigate whether their combination improves performance. We find that protein knowledge graphs that include predicate information perform comparable to the current state of the art, achieving an area under the receiver operating characteristic curve (AUC) of 79.6% on average across all four reference sets. Protein knowledge graphs that lack predicate information perform comparable to our other baseline (genetic distance) which achieved an AUC of 75.7% across all four reference sets. Combining multiple methods improved performance to 84.9% AUC. We conclude that methods for a protein knowledge graph can be used to identify which genes are targeted by disease-associated non-coding SNPs.

摘要

全基因组关联研究 (GWAS) 已经确定了许多单核苷酸多态性 (SNPs)，它们在性状和疾病的遗传易感性中起着重要作用。由于大多数这些 SNPs 位于基因组的非编码部分，目前假设这些 SNPs 影响基因组上附近基因的表达。然而，确定这些与疾病相关的 SNPs 靶向哪些基因仍然具有挑战性。过去，蛋白质知识图谱经常被用于识别与疾病相关的基因，也称为“疾病基因”。在这里，我们通过测试和比较六个现有蛋白质知识图谱方法的性能来探索蛋白质知识图谱是否可以用于识别与疾病相关的非编码 SNPs 靶向的基因，其中四个方法是为疾病基因识别而开发的。我们将我们的性能与两个基线进行比较：（1）一种基于关联有罪的现有最先进方法，（2）最主要的假设，即 SNPs 靶向基因组上最近的基因。我们使用四个参考集测试这些方法，其中三个是通过不同的方式获得的。此外，我们结合方法来研究它们的组合是否可以提高性能。我们发现，包含谓词信息的蛋白质知识图谱与当前最先进的技术表现相当，在所有四个参考集上平均达到了接收器操作特征曲线 (ROC) 下面积 (AUC) 的 79.6%。缺乏谓词信息的蛋白质知识图谱与我们的另一个基线（遗传距离）相当，在所有四个参考集上的 AUC 为 75.7%。组合多种方法可将性能提高到 84.9% AUC。我们得出结论，蛋白质知识图谱的方法可用于识别与疾病相关的非编码 SNPs 靶向的基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b0b/9278741/1dea58c3b998/pone.0271395.g001.jpg

相似文献

Identifying genes targeted by disease-associated non-coding SNPs with a protein knowledge graph.利用蛋白质知识图谱鉴定与疾病相关的非编码 SNPs 靶向的基因。

PLoS One. 2022 Jul 13;17(7):e0271395. doi: 10.1371/journal.pone.0271395. eCollection 2022.

Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data.用于识别源自全基因组关联研究（GWAS）数据的单核苷酸多态性上位性相互作用的基因、通路和网络框架。

BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S15. doi: 10.1186/1752-0509-6-S3-S15. Epub 2012 Dec 17.

Where in the genome are significant single nucleotide polymorphisms from genome-wide association studies located?全基因组关联研究中的重要单核苷酸多态性位于基因组的哪些位置？

OMICS. 2011 Jul-Aug;15(7-8):507-12. doi: 10.1089/omi.2010.0154. Epub 2011 Jun 23.

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。

BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

Genome-Wide Association Study for Major Biofuel Traits in Sorghum Using Minicore Collection.利用核心种质资源对高粱主要生物燃料性状进行全基因组关联研究。

Protein Pept Lett. 2021;28(8):909-928. doi: 10.2174/0929866528666210215141243.

Pinpointing miRNA and genes enrichment over trait-relevant tissue network in Genome-Wide Association Studies.在全基因组关联研究中，针对与性状相关组织网络的 miRNA 和基因富集进行精确定位。

BMC Med Genomics. 2020 Dec 28;13(Suppl 11):191. doi: 10.1186/s12920-020-00830-w.

Genome-wide association studies and genomic prediction of breeding values for calving performance and body conformation traits in Holstein cattle.荷斯坦奶牛产犊性能和体型外貌性状的全基因组关联研究及育种值的基因组预测

Genet Sel Evol. 2017 Nov 7;49(1):82. doi: 10.1186/s12711-017-0356-8.

Genome wide association studies for body conformation traits in the Chinese Holstein cattle population.中国荷斯坦牛群体体躯结构特征的全基因组关联研究。

BMC Genomics. 2013 Dec 17;14:897. doi: 10.1186/1471-2164-14-897.

[Analysis of single nucleotide polymorphisms (SNPs)].[单核苷酸多态性（SNPs）分析]

Rinsho Byori. 2013 Nov;61(11):1008-17.

Genome-wide association study of reproductive traits in Nellore heifers using Bayesian inference.使用贝叶斯推断对内洛尔小母牛繁殖性状进行全基因组关联研究。

Genet Sel Evol. 2015 Aug 19;47(1):67. doi: 10.1186/s12711-015-0146-0.

引用本文的文献

A multi-omics study of brain tissue transcription and DNA methylation revealing the genetic pathogenesis of ADHD.一项针对脑组织转录组和 DNA 甲基化的多组学研究揭示了 ADHD 的遗传发病机制。

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae502.

In silico protein function prediction: the rise of machine learning-based approaches.计算机模拟蛋白质功能预测：基于机器学习方法的兴起

Med Rev (2021). 2023 Nov 29;3(6):487-510. doi: 10.1515/mr-2023-0038. eCollection 2023 Dec.

本文引用的文献

An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci.系统地优先考虑所有已发表的人类 GWAS 性状关联基因座的因果变异和基因的开放方法。

Nat Genet. 2021 Nov;53(11):1527-1533. doi: 10.1038/s41588-021-00945-5. Epub 2021 Oct 28.

Global burden of 369 diseases and injuries in 204 countries and territories, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019.204 个国家和地区 1990-2019 年 369 种疾病和伤害导致的全球负担：2019 年全球疾病负担研究的系统分析。

Lancet. 2020 Oct 17;396(10258):1204-1222. doi: 10.1016/S0140-6736(20)30925-9.

Identifying disease trajectories with predicate information from a knowledge graph.基于知识图谱中的谓词信息识别疾病轨迹。

J Biomed Semantics. 2020 Aug 20;11(1):9. doi: 10.1186/s13326-020-00228-8.

Benchmarker: An Unbiased, Association-Data-Driven Strategy to Evaluate Gene Prioritization Algorithms.基准器：一种无偏倚、基于关联数据的基因优先级算法评估策略。

Am J Hum Genet. 2019 Jun 6;104(6):1025-1039. doi: 10.1016/j.ajhg.2019.03.027. Epub 2019 May 2.

Predicting Parkinson's Disease Genes Based on Node2vec and Autoencoder.基于Node2vec和自动编码器预测帕金森病基因

Front Genet. 2019 Apr 2;10:226. doi: 10.3389/fgene.2019.00226. eCollection 2019.

An expanded variant list and assembly annotation identifies multiple novel coding and noncoding genes for prostate cancer risk using a normal prostate tissue eQTL data set.利用正常前列腺组织 eQTL 数据集，扩展的变异列表和组装注释确定了多个与前列腺癌风险相关的新编码和非编码基因。

PLoS One. 2019 Apr 8;14(4):e0214588. doi: 10.1371/journal.pone.0214588. eCollection 2019.

Post-GWAS in prostate cancer: from genetic association to biological contribution.GWAS 后前列腺癌研究：从遗传关联到生物学贡献。

Nat Rev Cancer. 2019 Jan;19(1):46-59. doi: 10.1038/s41568-018-0087-3.

Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.全球癌症统计数据 2018：GLOBOCAN 对全球 185 个国家/地区 36 种癌症的发病率和死亡率的估计。

CA Cancer J Clin. 2018 Nov;68(6):394-424. doi: 10.3322/caac.21492. Epub 2018 Sep 12.

Using predicate and provenance information from a knowledge graph for drug efficacy screening.利用知识图谱中的谓词和出处信息进行药物疗效筛选。

J Biomed Semantics. 2018 Sep 6;9(1):23. doi: 10.1186/s13326-018-0189-6.

Fine-mapping of prostate cancer susceptibility loci in a large meta-analysis identifies candidate causal variants.在一项大型荟萃分析中对前列腺癌易感性位点进行精细映射，确定了候选因果变异。

Nat Commun. 2018 Jun 11;9(1):2256. doi: 10.1038/s41467-018-04109-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用蛋白质知识图谱鉴定与疾病相关的非编码 SNPs 靶向的基因。

Identifying genes targeted by disease-associated non-coding SNPs with a protein knowledge graph.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献