College of Software, NanKai University, TianJin, 300071, China.
Computer Science and Information Engineering, Tianjin University of Science and Technology, TianJin, 300222, China.
Interdiscip Sci. 2018 Sep;10(3):572-582. doi: 10.1007/s12539-018-0296-1. Epub 2018 Apr 24.
Gene-phenotype association prediction can be applied to reveal the inherited basis of human diseases and facilitate drug development. Gene-phenotype associations are related to complex biological processes and influenced by various factors, such as relationship between phenotypes and that among genes. While due to sparseness of curated gene-phenotype associations and lack of integrated analysis of the joint effect of multiple factors, existing applications are limited to prediction accuracy and potential gene-phenotype association detection. In this paper, we propose a novel method by exploiting weighted graph constraint learned from hierarchical structures of phenotype data and group prior information among genes by inheriting advantages of Non-negative Matrix Factorization (NMF), called Weighted Graph Constraint and Group Centric Non-negative Matrix Factorization (GC[Formula: see text]NMF). Specifically, first we introduce the depth of parent-child relationships between two adjacent phenotypes in hierarchical phenotypic data as weighted graph constraint for a better phenotype understanding. Second, we utilize intra-group correlation among genes in a gene group as group constraint for gene understanding. Such information provides us with the intuition that genes in a group probably result in similar phenotypes. The model not only allows us to achieve a high-grade prediction performance, but also helps us to learn interpretable representation of genes and phenotypes simultaneously to facilitate future biological analysis. Experimental results on biological gene-phenotype association datasets of mouse and human demonstrate that GC[Formula: see text]NMF can obtain superior prediction accuracy and good understandability for biological explanation over other state-of-the-arts methods.
基因-表型关联预测可用于揭示人类疾病的遗传基础,促进药物研发。基因-表型关联与复杂的生物过程有关,并受到多种因素的影响,如表型之间和基因之间的关系。然而,由于已注释的基因-表型关联稀疏,以及缺乏对多个因素联合效应的综合分析,现有的应用仅限于预测准确性和潜在的基因-表型关联检测。在本文中,我们提出了一种新的方法,通过利用从层次结构数据和基因之间的组先验信息中学习到的加权图约束来继承非负矩阵分解(NMF)的优势,称为加权图约束和基于群组的非负矩阵分解(GC[Formula: see text]NMF)。具体来说,首先,我们在层次化的表型数据中引入两个相邻表型之间的父子关系深度作为加权图约束,以更好地理解表型。其次,我们利用基因组中基因之间的组内相关性作为组约束,以了解基因。这些信息使我们产生了一个直观的认识,即一个基因组中的基因可能导致相似的表型。该模型不仅可以实现高等级的预测性能,还可以帮助我们同时学习可解释的基因和表型表示,以便于未来的生物学分析。在小鼠和人类的生物基因-表型关联数据集上的实验结果表明,GC[Formula: see text]NMF 可以在其他现有方法之上获得卓越的预测准确性和良好的生物学解释可理解性。