Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore.
Bioinformatics. 2022 Sep 30;38(19):4554-4561. doi: 10.1093/bioinformatics/btac543.
In many biomedical studies, there arises the need to integrate data from multiple directly or indirectly related sources. Collective matrix factorization (CMF) and its variants are models designed to collectively learn from arbitrary collections of matrices. The latent factors learnt are rich integrative representations that can be used in downstream tasks, such as clustering or relation prediction with standard machine-learning models. Previous CMF-based methods have numerous modeling limitations. They do not adequately capture complex non-linear interactions and do not explicitly model varying sparsity and noise levels in the inputs, and some cannot model inputs with multiple datatypes. These inadequacies limit their use on many biomedical datasets.
To address these limitations, we develop Neural Collective Matrix Factorization (NCMF), the first fully neural approach to CMF. We evaluate NCMF on relation prediction tasks of gene-disease association prediction and adverse drug event prediction, using multiple datasets. In each case, data are obtained from heterogeneous publicly available databases and used to learn representations to build predictive models. NCMF is found to outperform previous CMF-based methods and several state-of-the-art graph embedding methods for representation learning in our experiments. Our experiments illustrate the versatility and efficacy of NCMF in representation learning for seamless integration of heterogeneous data.
https://github.com/ajayago/NCMF_bioinformatics.
Supplementary data are available at Bioinformatics online.
在许多生物医学研究中,需要整合来自多个直接或间接相关来源的数据。集体矩阵分解(CMF)及其变体是为从任意矩阵集合中集体学习而设计的模型。学到的潜在因子是丰富的综合表示,可以在下游任务中使用,例如使用标准机器学习模型进行聚类或关系预测。以前基于 CMF 的方法存在许多建模限制。它们不能充分捕捉复杂的非线性相互作用,也不能显式地对输入中的变化稀疏度和噪声水平进行建模,并且有些方法不能对具有多种数据类型的输入进行建模。这些不足限制了它们在许多生物医学数据集上的使用。
为了解决这些限制,我们开发了神经集体矩阵分解(NCMF),这是第一个完全基于神经的 CMF 方法。我们在基因-疾病关联预测和不良药物事件预测的关系预测任务上评估了 NCMF,使用了多个数据集。在每种情况下,数据都是从异构的公共可用数据库中获得的,并用于学习表示来构建预测模型。在我们的实验中,NCMF 被发现优于以前基于 CMF 的方法和几种最先进的图嵌入方法的表示学习。我们的实验说明了 NCMF 在表示学习中的多功能性和有效性,用于异构数据的无缝集成。
https://github.com/ajayago/NCMF_bioinformatics。
补充数据可在《生物信息学》在线获得。