X-LDA：一种用于lncRNA-疾病关联预测的可解释且基于知识的异构图学习框架。

X-LDA: An interpretable and knowledge-informed heterogeneous graph learning framework for LncRNA-disease association prediction.

作者信息

Cao Yangkun, Xiao Jun, Sheng Nan, Qu Yinwei, Wang Zhihang, Sun Chang, Mu Xuechen, Huang Zhenyu, Li Xuan

机构信息

School of Artificial Intelligence, Jilin University, Changchun, 130012, China.

College of Computer Science and Technology, Jilin University, Changchun, 130012, China.

出版信息

Comput Biol Med. 2023 Oct 27;167:107634. doi: 10.1016/j.compbiomed.2023.107634.

DOI:10.1016/j.compbiomed.2023.107634

PMID:39491920

Abstract

The identification of disease-related long noncoding RNAs (lncRNAs) is beneficial to unravel the intricacies of gene expression regulation and epigenetic signatures. Computational methods provide a cost-effective means to explore lncRNA-disease associations (LDAs). However, these methods often lack interpretability, leaving their predictions less convincing to biological and medical researchers. We propose an interpretable and knowledge-informed heterogeneous graph learning framework based on graph patch convolution and integrated gradients to predict LDAs and provides intuitive explanations for its predictions, called X-LDA. The heterogeneous graph is the foundation of the predictions of LDAs, we construct the knowledge-informed heterogeneous graph including LDAs drawn from biological experiments, lncRNA similarities rooted in gene sequences, disease similarities constructed based on disease categorizations. To integrate diverse biological premises and facilitate interpretability, we define nine distinct graph patch types, which encapsulate essential topological relationships within lncRNA-disease node pairs. X-LDA is designed to employ parameter sharing and multi-convolution kernels to grasp common and multiple perspectives of the graph patches, respectively. This approach culminates in the fusion of various semantic information into context embeddings. These post-hoc explanations hinge on graph patch features and integrated gradients, shedding light on the underlying factors driving predictions. Cross validation experiment on the dataset curated from databases and literatures demonstrates that the superior performance of X-LDA in comparison to nine state-of-the-art methods of three categories. X-LDA achieves a larger average area under the receiver operating curve 0.9891 (by at least 6.68%), and a larger average area under the precision-recall curve 0.7907 (by at least 23.2%) than competitive methods. The results of our well-designed ablation and interpretability experiments and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis demonstrate X-LDA's robustness, learnability, predictability, and interpretability. The applicability of X-LDA is also demonstrated through a case study involving the investigation of associated lncRNAs in prostate cancer, colorectal cancer, and breast cancer.

摘要

疾病相关长链非编码RNA（lncRNA）的鉴定有助于揭示基因表达调控和表观遗传特征的复杂性。计算方法为探索lncRNA与疾病的关联（LDA）提供了一种经济高效的手段。然而，这些方法往往缺乏可解释性，使得它们的预测结果难以令生物和医学研究人员信服。我们提出了一种基于图块卷积和集成梯度的可解释且知识驱动的异构图学习框架，用于预测LDA，并为其预测结果提供直观解释，称为X-LDA。异构图是LDA预测的基础，我们构建了知识驱动的异构图，包括从生物学实验中得出的LDA、基于基因序列的lncRNA相似性以及基于疾病分类构建的疾病相似性。为了整合各种生物学前提并促进可解释性，我们定义了九种不同的图块类型，它们封装了lncRNA - 疾病节点对中的基本拓扑关系。X-LDA旨在采用参数共享和多卷积核，分别把握图块的共同和多个视角。这种方法最终将各种语义信息融合到上下文嵌入中。这些事后解释依赖于图块特征和集成梯度，揭示了驱动预测的潜在因素。对从数据库和文献中整理的数据集进行的交叉验证实验表明，与三类九种现有最先进方法相比，X-LDA具有卓越的性能。X-LDA在接收器操作曲线下的平均面积更大，为0.9891（至少提高6.68%），在精确召回曲线下的平均面积更大，为0.7907（至少提高23.2%）。我们精心设计的消融实验、可解释性实验以及京都基因与基因组百科全书（KEGG）富集分析的结果证明了X-LDA的稳健性、可学习性、可预测性和可解释性。通过对前列腺癌、结直肠癌和乳腺癌中相关lncRNA的调查案例研究，也证明了X-LDA的适用性。