Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA.
Bioinformatics. 2014 Jun 15;30(12):i60-68. doi: 10.1093/bioinformatics/btu269.
Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies-for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies-for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene-disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive.
Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better-it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has <15% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature.
Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease.
现有的大多数预测因果疾病基因的方法都依赖于特定类型的证据,因此在适用性方面存在局限性。通常情况下,可用于疾病的证据类型各不相同,例如,我们可能知道关联基因、通过挖掘文本获得的与疾病相关的关键字,或者患者疾病症状的共同出现。同样,可用于基因的证据类型也各不相同,例如,特定的微阵列探针仅传达特定基因集的信息。在本文中,我们将一种称为归纳矩阵补全的新型矩阵补全方法应用于预测基因-疾病关联的问题中;它结合了疾病和基因的多种类型的证据(特征),以学习解释观察到的基因-疾病关联的潜在因素。我们从不同的生物来源构建特征,例如微阵列表达数据和与疾病相关的文本数据。该方法的一个关键优势是它是归纳的;与传统的矩阵补全方法和基于网络的推理方法不同,它可以应用于训练时未见过的疾病,而不是传输的方法。
与在线孟德尔遗传数据库(OMIM)中的疾病的最新方法相比,所提出的方法要好得多-与最近提出的 Catapult 方法(第二好)相比,它有近四分之一的机会在前 100 次预测中恢复真实关联,该方法的机会<15%。我们证明,对于以前没有已知基因关联的查询疾病,以及预测新基因,即以前与疾病没有关联的基因,归纳方法特别有效。我们还通过在最近报道的 OMIM 关联和最近在文献中报道的关联上评估该方法,验证了预测的新颖性。
可以从 http://bigdata.ices.utexas.edu/project/gene-disease 下载源代码和数据集。