College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China.
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China.
Comput Biol Chem. 2020 Apr;85:107200. doi: 10.1016/j.compbiolchem.2020.107200. Epub 2020 Jan 28.
MicroRNAs (miRNAs) have been proved to play an indispensable role in many fundamental biological processes, and the dysregulation of miRNAs is closely correlated with human complex diseases. Many studies have focused on the prediction of potential miRNA-disease associations. Considering the insufficient number of known miRNA-disease associations and the poor performance of many existing prediction methods, a novel model combining gradient boosting decision tree with logistic regression (GBDT-LR) is proposed to prioritize miRNA candidates for diseases. To balance positive and negative samples, GBDT-LR firstly adopted k-means clustering to screen negative samples from unknown miRNA-disease associations. Then, the gradient boosting decision tree (GBDT) model, which has an intrinsic advantage in finding many distinguishing features and feature combinations is applied to extract features. Finally, the new features extracted by the GBDT model are input into a logistic regression (LR) model for predicting the final miRNA-disease association score. The experimental results show that the average AUC of GBDT-LR in 5-fold cross-validation (CV) can achieve 0.9274. Besides, in the case studies, 90 %, 94 % and 88 % of the top 50 miRNAs potentially associated with colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases, respectively. Compared with the other three state-of-the-art methods, GBDT-LR can achieve the best prediction performance. The source code and dataset of GBDT-LR are freely available at https://github.com/Pualalala/GBDT-LR.
微小 RNA(miRNA)已被证明在许多基本的生物过程中起着不可或缺的作用,miRNA 的失调与人类复杂疾病密切相关。许多研究都集中在预测潜在的 miRNA-疾病关联上。考虑到已知 miRNA-疾病关联的数量不足和许多现有预测方法的性能不佳,提出了一种结合梯度提升决策树和逻辑回归(GBDT-LR)的新模型,以优先考虑候选 miRNA 与疾病的关联。为了平衡正样本和负样本,GBDT-LR 首先采用 K-means 聚类从未知 miRNA-疾病关联中筛选负样本。然后,应用梯度提升决策树(GBDT)模型来提取特征,该模型在寻找许多有区别的特征和特征组合方面具有内在优势。最后,将 GBDT 模型提取的新特征输入逻辑回归(LR)模型,以预测最终的 miRNA-疾病关联评分。实验结果表明,在 5 折交叉验证(CV)中,GBDT-LR 的平均 AUC 可以达到 0.9274。此外,在案例研究中,与结肠癌、胃癌和胰腺癌潜在相关的前 50 个 miRNA 中有 90%、94%和 88%分别被数据库证实。与其他三种最先进的方法相比,GBDT-LR 可以实现最佳的预测性能。GBDT-LR 的源代码和数据集可在 https://github.com/Pualalala/GBDT-LR 上免费获取。