College of Computer and Information Science, Southwest University, Chongqing 400715, China.
Department of Computer Science, George Mason University, Farifax, VA 22030, USA.
Bioinformatics. 2018 May 1;34(9):1529-1537. doi: 10.1093/bioinformatics/btx794.
Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be.
To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities.
The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA.
Supplementary data are available at Bioinformatics online.
长非编码 RNA(lncRNA)在复杂疾病的诊断、预后、预防和治疗中起着至关重要的作用,但只有一小部分 lncRNA-疾病关联已通过实验验证。各种计算模型已经被提出,通过整合异构数据源来识别 lncRNA-疾病关联。然而,现有的模型通常忽略了数据源的内在结构,或者将它们视为同等相关,而实际上它们可能并不相关。
为了准确识别 lncRNA-疾病关联,我们提出了一种基于矩阵分解的 lncRNA-疾病关联预测模型(简称 MFLDA)。MFLDA 通过矩阵三因子分解将异构数据源的数据矩阵分解为低秩矩阵,以探索和利用它们的内在和共享结构。MFLDA 可以通过为它们分配不同的权重来选择和整合数据源。进一步引入了一种迭代解决方案,以同时优化权重和低秩矩阵。接下来,MFLDA 使用优化后的低秩矩阵来重构 lncRNA-疾病关联矩阵,从而识别潜在的关联。在 5 折交叉验证实验中,用于识别已验证的 lncRNA-疾病关联,MFLDA 的接收者操作特征曲线(AUC)下面积达到 0.7408,至少比基于最先进的数据融合的计算模型高 3%。对识别掩蔽 lncRNA-疾病关联的实证研究再次表明,MFLDA 能够比竞争模型更准确地识别潜在的关联。对识别与乳腺癌、肺癌和胃癌相关的 lncRNA 的案例研究表明,MFLDA 预测的 45 个(84%)关联中的 38 个得到了最近生物医学文献的支持,进一步证明了 MFLDA 识别新型 lncRNA-疾病关联的能力。MFLDA 是一个通用的数据融合框架,因此可以应用于预测其他生物实体之间的关联。
MFLDA 的源代码可在 http://mlda.swu.edu.cn/codes.php? name = MFLDA 获得。
补充数据可在《生物信息学》在线获得。