School of Software and Microelectronics, Harbin University of Science and Technology, Harbin, 150080, China.
College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, 150050, China.
BMC Bioinformatics. 2020 Mar 27;21(1):126. doi: 10.1186/s12859-020-3458-1.
Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources.
To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models.
Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.
越来越多的证据表明,长链非编码 RNA(lncRNA)的异常调控与各种人类疾病有关。准确识别与疾病相关的 lncRNA 有助于研究 lncRNA 在疾病中的作用机制,并探索疾病的新疗法。许多 lncRNA-疾病关联(LDA)预测模型通过整合多种数据资源来实现。然而,大多数现有的模型忽略了这些数据资源中噪声和冗余信息的干扰。
为了提高 LDA 预测模型的能力,我们实现了一种基于随机森林和特征选择的 LDA 预测模型(简称 RFLDA)。首先,RFLDA 将实验支持的 miRNA-疾病关联(MDAs)和 LDAs、疾病语义相似性(DSS)、lncRNA 功能相似性(LFS)和 lncRNA-miRNA 相互作用(LMI)整合为输入特征。然后,RFLDA 通过基于随机森林变量重要性得分的特征选择,选择最有用的特征来训练预测模型,该得分不仅考虑了单个特征对预测结果的影响,还考虑了多个特征对预测结果的联合影响。最后,使用随机森林回归模型对潜在的 lncRNA-疾病关联进行评分。在 5 折交叉验证下,RFLDA 的 AUC 为 0.976,AUPR 为 0.779,性能优于几种最新的 LDA 预测模型。此外,对三种癌症的案例研究表明,RFLDA 预测的 45 个 lncRNA 中有 43 个得到了实验数据的验证,另外两个预测的 lncRNA 得到了其他 LDA 预测模型的支持。
交叉验证和案例研究表明,RFLDA 具有识别潜在疾病相关 lncRNA 的优异能力。