School of Computer Science, Qufu Normal University, Rizhao, China.
Department of Internet of Things Engineering, Wuxi Taihu University, Wuxi, China.
BMC Bioinformatics. 2021 Apr 1;22(1):175. doi: 10.1186/s12859-021-04104-9.
Identifying lncRNA-disease associations not only helps to better comprehend the underlying mechanisms of various human diseases at the lncRNA level but also speeds up the identification of potential biomarkers for disease diagnoses, treatments, prognoses, and drug response predictions. However, as the amount of archived biological data continues to grow, it has become increasingly difficult to detect potential human lncRNA-disease associations from these enormous biological datasets using traditional biological experimental methods. Consequently, developing new and effective computational methods to predict potential human lncRNA diseases is essential.
Using a combination of incremental principal component analysis (IPCA) and random forest (RF) algorithms and by integrating multiple similarity matrices, we propose a new algorithm (IPCARF) based on integrated machine learning technology for predicting lncRNA-disease associations. First, we used two different models to compute a semantic similarity matrix of diseases from a directed acyclic graph of diseases. Second, a characteristic vector for each lncRNA-disease pair is obtained by integrating disease similarity, lncRNA similarity, and Gaussian nuclear similarity. Then, the best feature subspace is obtained by applying IPCA to decrease the dimension of the original feature set. Finally, we train an RF model to predict potential lncRNA-disease associations. The experimental results show that the IPCARF algorithm effectively improves the AUC metric when predicting potential lncRNA-disease associations. Before the parameter optimization procedure, the AUC value predicted by the IPCARF algorithm under 10-fold cross-validation reached 0.8529; after selecting the optimal parameters using the grid search algorithm, the predicted AUC of the IPCARF algorithm reached 0.8611.
We compared IPCARF with the existing LRLSLDA, LRLSLDA-LNCSIM, TPGLDA, NPCMF, and ncPred prediction methods, which have shown excellent performance in predicting lncRNA-disease associations. The compared results of 10-fold cross-validation procedures show that the predictions of the IPCARF method are better than those of the other compared methods.
鉴定 lncRNA-疾病关联不仅有助于从 lncRNA 水平更好地理解各种人类疾病的潜在机制,而且还可以加速鉴定疾病诊断、治疗、预后和药物反应预测的潜在生物标志物。然而,随着存档生物数据量的不断增加,使用传统的生物实验方法从这些巨大的生物数据集中检测潜在的人类 lncRNA-疾病关联变得越来越困难。因此,开发新的有效的计算方法来预测潜在的人类 lncRNA 疾病至关重要。
我们使用增量主成分分析(IPCA)和随机森林(RF)算法的组合,并整合多个相似性矩阵,提出了一种基于集成机器学习技术的新算法(IPCARF),用于预测 lncRNA-疾病关联。首先,我们使用两种不同的模型从疾病的有向无环图计算疾病的语义相似性矩阵。其次,通过整合疾病相似性、lncRNA 相似性和高斯核相似性,获得每个 lncRNA-疾病对的特征向量。然后,通过应用 IPCA 获得最佳特征子空间,以降低原始特征集的维度。最后,我们训练 RF 模型来预测潜在的 lncRNA-疾病关联。实验结果表明,IPCARF 算法在预测潜在的 lncRNA-疾病关联时有效地提高了 AUC 度量。在进行参数优化过程之前,IPCARF 算法在 10 倍交叉验证下的 AUC 值达到 0.8529;在使用网格搜索算法选择最佳参数后,IPCARF 算法的预测 AUC 达到 0.8611。
我们将 IPCARF 与现有的 LRLSLDA、LRLSLDA-LNCSIM、TPGLDA、NPCMF 和 ncPred 预测方法进行了比较,这些方法在预测 lncRNA-疾病关联方面表现出了优异的性能。10 倍交叉验证程序的比较结果表明,IPCARF 方法的预测优于其他比较方法。