HetIG-PreDiG：一种基于基因表达的用于预测人类疾病基因的异构集成图模型。

HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression.

机构信息

The School of Business Administration, Bar-Ilan University, Ramat Gan, Israel.

Department of Psychiatry, Harvard Medical School, Boston, MA, United States of America.

出版信息

PLoS One. 2023 Feb 15;18(2):e0280839. doi: 10.1371/journal.pone.0280839. eCollection 2023.

DOI:10.1371/journal.pone.0280839

PMID:36791052

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9931161/

Abstract

Graph analytical approaches permit identifying novel genes involved in complex diseases, but are limited by (i) inferring structural network similarity of connected gene nodes, ignoring potentially relevant unconnected nodes; (ii) using homogeneous graphs, missing gene-disease associations' complexity; (iii) relying on disease/gene-phenotype associations' similarities, involving highly incomplete data; (iv) using binary classification, with gene-disease edges as positive training samples, and non-associated gene and disease nodes as negative samples that may include currently unknown disease genes; or (v) reporting predicted novel associations without systematically evaluating their accuracy. Addressing these limitations, we develop the Heterogeneous Integrated Graph for Predicting Disease Genes (HetIG-PreDiG) model that includes gene-gene, gene-disease, and gene-tissue associations. We predict novel disease genes using low-dimensional representation of nodes accounting for network structure, and extending beyond network structure using the developed Gene-Disease Prioritization Score (GDPS) reflecting the degree of gene-disease association via gene co-expression data. For negative training samples, we select non-associated gene and disease nodes with lower GDPS that are less likely to be affiliated. We evaluate the developed model's success in predicting novel disease genes by analyzing the prediction probabilities of gene-disease associations. HetIG-PreDiG successfully predicts (Micro-F1 = 0.95) gene-disease associations, outperforming baseline models, and is validated using published literature, thus advancing our understanding of complex genetic diseases.

摘要

图形分析方法可以识别涉及复杂疾病的新基因，但存在以下局限性：（i）推断连接基因节点的结构网络相似性，忽略了潜在相关的未连接节点；（ii）使用同构图，遗漏了基因-疾病关联的复杂性；（iii）依赖于疾病/基因-表型关联的相似性，涉及高度不完整的数据；（iv）使用二进制分类，将基因-疾病边缘作为阳性训练样本，将非相关的基因和疾病节点作为负样本，其中可能包括目前未知的疾病基因；或者（v）报告预测的新关联，而没有系统地评估其准确性。为了解决这些局限性，我们开发了用于预测疾病基因的异质集成图模型（HetIG-PreDiG），该模型包括基因-基因、基因-疾病和基因-组织关联。我们使用节点的低维表示来预测新的疾病基因，该表示考虑了网络结构，并通过开发的基因-疾病优先得分（GDPS）扩展到网络结构之外，该得分通过基因共表达数据反映基因-疾病关联的程度。对于负训练样本，我们选择具有较低 GDPS 的非相关基因和疾病节点，这些节点不太可能与疾病相关。我们通过分析基因-疾病关联的预测概率来评估所开发模型预测新疾病基因的成功程度。HetIG-PreDiG 成功预测了（微-F1 = 0.95）基因-疾病关联，优于基线模型，并通过已发表的文献进行了验证，从而加深了我们对复杂遗传疾病的理解。