College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China.
Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China.
Interdiscip Sci. 2024 Dec;16(4):781-801. doi: 10.1007/s12539-024-00638-7. Epub 2024 Sep 4.
Using genes which have been experimentally-validated for diseases (functions) can develop machine learning methods to predict new disease/function-genes. However, the prediction of both function-genes and disease-genes faces the same problem: there are only certain positive examples, but no negative examples. To solve this problem, we proposed a function/disease-genes prediction algorithm based on network embedding (Variational Graph Auto-Encoders, VGAE) and one-class classification (Fast Minimum Covariance Determinant, Fast-MCD): VGAEMCD. Firstly, we constructed a protein-protein interaction (PPI) network centered on experimentally-validated genes; then VGAE was used to get the embeddings of nodes (genes) in the network; finally, the embeddings were input into the improved deep learning one-class classifier based on Fast-MCD to predict function/disease-genes. VGAEMCD can predict function-gene and disease-gene in a unified way, and only the experimentally-verified genes are needed to provide (no need for expression profile). VGAEMCD outperforms classical one-class classification algorithms in Recall, Precision, F-measure, Specificity, and Accuracy. Further experiments show that seven metrics of VGAEMCD are higher than those of state-of-art function/disease-genes prediction algorithms. The above results indicate that VGAEMCD can well learn the distribution characteristics of positive examples and accurately identify function/disease-genes.
利用经过实验验证的与疾病(功能)相关的基因,可以开发机器学习方法来预测新的疾病/功能基因。然而,功能基因和疾病基因的预测都面临着一个相同的问题:只有某些阳性例子,但没有阴性例子。为了解决这个问题,我们提出了一种基于网络嵌入(变分图自动编码器,VGAE)和单类分类(快速最小协方差确定,Fast-MCD)的功能/疾病基因预测算法:VGAEMCD。首先,我们构建了一个以实验验证基因为中心的蛋白质-蛋白质相互作用(PPI)网络;然后使用 VGAE 获得网络中节点(基因)的嵌入;最后,将嵌入输入到基于 Fast-MCD 的改进深度学习单类分类器中,以预测功能/疾病基因。VGAEMCD 可以统一地预测功能基因和疾病基因,只需要提供经过实验验证的基因(不需要表达谱)。VGAEMCD 在召回率、精度、F 值、特异性和准确性方面均优于经典的单类分类算法。进一步的实验表明,VGAEMCD 的七个指标均高于最先进的功能/疾病基因预测算法。上述结果表明,VGAEMCD 可以很好地学习阳性例子的分布特征,并准确地识别功能/疾病基因。