School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Daxue Road 3501, Jinan, 250353, Shandong, China.
School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, 473004, Henan, China.
BMC Bioinformatics. 2021 Jun 16;22(1):332. doi: 10.1186/s12859-021-04256-8.
LncRNAs (Long non-coding RNAs) are a type of non-coding RNA molecule with transcript length longer than 200 nucleotides. LncRNA has been novel candidate biomarkers in cancer diagnosis and prognosis. However, it is difficult to discover the true association mechanism between lncRNAs and complex diseases. The unprecedented enrichment of multi-omics data and the rapid development of machine learning technology provide us with the opportunity to design a machine learning framework to study the relationship between lncRNAs and complex diseases.
In this article, we proposed a new machine learning approach, namely LGDLDA (LncRNA-Gene-Disease association networks based LncRNA-Disease Association prediction), for disease-related lncRNAs association prediction based multi-omics data, machine learning methods and neural network neighborhood information aggregation. Firstly, LGDLDA calculates the similarity matrix of lncRNA, gene and disease respectively, and it calculates the similarity between lncRNAs through the lncRNA expression profile matrix, lncRNA-miRNA interaction matrix and lncRNA-protein interaction matrix. We obtain gene similarity matrix by calculating the lncRNA-gene association matrix and the gene-disease association matrix, and we obtain disease similarity matrix by calculating the disease ontology, the disease-miRNA association matrix, and Gaussian interaction profile kernel similarity. Secondly, LGDLDA integrates the neighborhood information in similarity matrices by using nonlinear feature learning of neural network. Thirdly, LGDLDA uses embedded node representations to approximate the observed matrices. Finally, LGDLDA ranks candidate lncRNA-disease pairs and then selects potential disease-related lncRNAs.
Compared with lncRNA-disease prediction methods, our proposed method takes into account more critical information and obtains the performance improvement cancer-related lncRNA predictions. Randomly split data experiment results show that the stability of LGDLDA is better than IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA. The results on different simulation data sets show that LGDLDA can accurately and effectively predict the disease-related lncRNAs. Furthermore, we applied the method to three real cancer data including gastric cancer, colorectal cancer and breast cancer to predict potential cancer-related lncRNAs.
长非编码 RNA(lncRNAs)是一种转录本长度大于 200 个核苷酸的非编码 RNA 分子。lncRNA 已成为癌症诊断和预后的新型候选生物标志物。然而,发现 lncRNA 与复杂疾病之间的真正关联机制具有挑战性。多组学数据的空前丰富和机器学习技术的快速发展为我们设计机器学习框架来研究 lncRNA 与复杂疾病之间的关系提供了机会。
在本文中,我们提出了一种新的机器学习方法,即 LGDLDA(基于 lncRNA-基因-疾病关联网络的 lncRNA-疾病关联预测),用于基于多组学数据、机器学习方法和神经网络邻域信息聚合的疾病相关 lncRNA 关联预测。首先,LGDLDA 分别计算 lncRNA、基因和疾病的相似度矩阵,通过 lncRNA 表达谱矩阵、lncRNA-miRNA 互作矩阵和 lncRNA-蛋白互作矩阵计算 lncRNA 之间的相似度。通过计算 lncRNA-基因关联矩阵和基因-疾病关联矩阵得到基因相似度矩阵,通过计算疾病本体、疾病-miRNA 关联矩阵和高斯互作谱核相似度得到疾病相似度矩阵。其次,LGDLDA 通过神经网络的非线性特征学习整合相似度矩阵中的邻域信息。第三,LGDLDA 使用嵌入节点表示来近似观察矩阵。最后,LGDLDA 对候选 lncRNA-疾病对进行排名,并选择潜在的疾病相关 lncRNA。
与 lncRNA-疾病预测方法相比,我们提出的方法考虑了更多关键信息,提高了癌症相关 lncRNA 预测的性能。随机拆分数据实验结果表明,LGDLDA 的稳定性优于 IDHI-MIRW、NCPLDA、LncDisAP 和 NCPHLDA。在不同模拟数据集上的结果表明,LGDLDA 可以准确有效地预测疾病相关的 lncRNA。此外,我们将该方法应用于包括胃癌、结直肠癌和乳腺癌在内的三种真实癌症数据,以预测潜在的癌症相关 lncRNA。