IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):222-232. doi: 10.1109/TCBB.2017.2770120. Epub 2017 Nov 7.
Disease gene prediction is a challenging task that has a variety of applications such as early diagnosis and drug development. The existing machine learning methods suffer from the imbalanced sample issue because the number of known disease genes (positive samples) is much less than that of unknown genes which are typically considered to be negative samples. In addition, most methods have not utilized clinical data from patients with a specific disease to predict disease genes. In this study, we propose a disease gene prediction algorithm (called dgSeq) by combining protein-protein interaction (PPI) network, clinical RNA-Seq data, and Online Mendelian Inheritance in Man (OMIN) data. Our dgSeq constructs differential networks based on rewiring information calculated from clinical RNA-Seq data. To select balanced sets of non-disease genes (negative samples), a disease-gene network is also constructed from OMIM data. After features are extracted from the PPI networks and differential networks, the logistic regression classifiers are trained. Our dgSeq obtains AUC values of 0.88, 0.83, and 0.80 for identifying breast cancer genes, thyroid cancer genes, and Alzheimer's disease genes, respectively, which indicates its superiority to other three competing methods. Both gene set enrichment analysis and predicted results demonstrate that dgSeq can effectively predict new disease genes.
疾病基因预测是一项具有挑战性的任务,具有多种应用,如早期诊断和药物开发。现有的机器学习方法存在样本不平衡的问题,因为已知疾病基因(阳性样本)的数量远远少于通常被认为是阴性样本的未知基因。此外,大多数方法没有利用特定疾病患者的临床数据来预测疾病基因。在这项研究中,我们通过结合蛋白质-蛋白质相互作用(PPI)网络、临床 RNA-Seq 数据和在线孟德尔遗传数据库(OMIN)数据,提出了一种疾病基因预测算法(称为 dgSeq)。我们的 dgSeq 基于从临床 RNA-Seq 数据计算的重连信息构建差异网络。为了选择平衡的非疾病基因(阴性样本)集,还从 OMIM 数据构建了疾病-基因网络。从 PPI 网络和差异网络中提取特征后,训练逻辑回归分类器。我们的 dgSeq 分别获得了 0.88、0.83 和 0.80 的 AUC 值,用于识别乳腺癌基因、甲状腺癌基因和阿尔茨海默病基因,表明其优于其他三种竞争方法。基因集富集分析和预测结果均表明,dgSeq 可以有效地预测新的疾病基因。