Yao Dengju, Zhang Tao, Zhan Xiaojuan, Zhang Shuli, Zhan Xiaorong, Zhang Chao
School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, China.
College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, China.
Front Genet. 2022 Aug 24;13:995532. doi: 10.3389/fgene.2022.995532. eCollection 2022.
More and more evidences have showed that the unnatural expression of long non-coding RNA (lncRNA) is relevant to varieties of human diseases. Therefore, accurate identification of disease-related lncRNAs can help to understand lncRNA expression at the molecular level and to explore more effective treatments for diseases. Plenty of lncRNA-disease association prediction models have been raised but it is still a challenge to recognize unknown lncRNA-disease associations. In this work, we have proposed a computational model for predicting lncRNA-disease associations based on geometric complement heterogeneous information and random forest. Firstly, geometric complement heterogeneous information was used to integrate lncRNA-miRNA interactions and miRNA-disease associations verified by experiments. Secondly, lncRNA and disease features consisted of their respective similarity coefficients were fused into input feature space. Thirdly, an autoencoder was adopted to project raw high-dimensional features into low-dimension space to learn representation for lncRNAs and diseases. Finally, the low-dimensional lncRNA and disease features were fused into input feature space to train a random forest classifier for lncRNA-disease association prediction. Under five-fold cross-validation, the AUC (area under the receiver operating characteristic curve) is 0.9897 and the AUPR (area under the precision-recall curve) is 0.7040, indicating that the performance of our model is better than several state-of-the-art lncRNA-disease association prediction models. In addition, case studies on colon and stomach cancer indicate that our model has a good ability to predict disease-related lncRNAs.
越来越多的证据表明,长链非编码RNA(lncRNA)的异常表达与多种人类疾病相关。因此,准确识别与疾病相关的lncRNA有助于从分子水平了解lncRNA的表达情况,并探索更有效的疾病治疗方法。虽然已经提出了大量lncRNA-疾病关联预测模型,但识别未知的lncRNA-疾病关联仍然是一项挑战。在这项工作中,我们提出了一种基于几何互补异构信息和随机森林的lncRNA-疾病关联预测计算模型。首先,利用几何互补异构信息整合lncRNA- miRNA相互作用和经实验验证的miRNA-疾病关联。其次,将由lncRNA和疾病各自的相似系数组成的特征融合到输入特征空间中。第三,采用自动编码器将原始高维特征投影到低维空间,以学习lncRNA和疾病的表示。最后,将低维的lncRNA和疾病特征融合到输入特征空间中,训练一个随机森林分类器用于lncRNA-疾病关联预测。在五折交叉验证下,受试者工作特征曲线下面积(AUC)为0.9897,精确率-召回率曲线下面积(AUPR)为0.7040,表明我们模型的性能优于几种当前最先进的lncRNA-疾病关联预测模型。此外,对结肠癌和胃癌的案例研究表明,我们的模型具有良好的预测疾病相关lncRNA的能力。