School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China.
School of Information Systems, Singapore Management University, 178902 Singapore.
Bioinformatics. 2020 Aug 15;36(16):4458-4465. doi: 10.1093/bioinformatics/btaa211.
Synthetic lethality (SL) is a promising form of gene interaction for cancer therapy, as it is able to identify specific genes to target at cancer cells without disrupting normal cells. As high-throughput wet-lab settings are often costly and face various challenges, computational approaches have become a practical complement. In particular, predicting SLs can be formulated as a link prediction task on a graph of interacting genes. Although matrix factorization techniques have been widely adopted in link prediction, they focus on mapping genes to latent representations in isolation, without aggregating information from neighboring genes. Graph convolutional networks (GCN) can capture such neighborhood dependency in a graph. However, it is still challenging to apply GCN for SL prediction as SL interactions are extremely sparse, which is more likely to cause overfitting.
In this article, we propose a novel dual-dropout GCN (DDGCN) for learning more robust gene representations for SL prediction. We employ both coarse-grained node dropout and fine-grained edge dropout to address the issue that standard dropout in vanilla GCN is often inadequate in reducing overfitting on sparse graphs. In particular, coarse-grained node dropout can efficiently and systematically enforce dropout at the node (gene) level, while fine-grained edge dropout can further fine-tune the dropout at the interaction (edge) level. We further present a theoretical framework to justify our model architecture. Finally, we conduct extensive experiments on human SL datasets and the results demonstrate the superior performance of our model in comparison with state-of-the-art methods.
DDGCN is implemented in Python 3.7, open-source and freely available at https://github.com/CXX1113/Dual-DropoutGCN.
Supplementary data are available at Bioinformatics online.
合成致死性(SL)是癌症治疗中一种很有前途的基因相互作用形式,因为它能够识别针对癌细胞的特定基因,而不会破坏正常细胞。由于高通量湿实验室设置通常成本高昂且面临各种挑战,因此计算方法已成为一种实用的补充。特别是,预测 SL 可以被表述为在相互作用基因的图上进行链接预测任务。尽管矩阵分解技术已广泛应用于链接预测,但它们侧重于孤立地将基因映射到潜在表示,而不聚合来自相邻基因的信息。图卷积网络(GCN)可以在图中捕获这种邻域依赖性。然而,由于 SL 相互作用非常稀疏,这更容易导致过拟合,因此应用 GCN 进行 SL 预测仍然具有挑战性。
在本文中,我们提出了一种新颖的双 dropout GCN(DDGCN),用于学习更稳健的基因表示,以进行 SL 预测。我们同时采用粗粒度节点 dropout 和细粒度边 dropout,以解决标准 GCN 中的 dropout 在稀疏图上减少过拟合的效果往往不足的问题。特别是,粗粒度节点 dropout 可以有效地、系统地在节点(基因)级别强制进行 dropout,而细粒度边 dropout 可以进一步微调交互(边)级别上的 dropout。我们进一步提出了一个理论框架来证明我们的模型架构。最后,我们在人类 SL 数据集上进行了广泛的实验,结果表明我们的模型在与最先进的方法相比时具有优越的性能。
DDGCN 是用 Python 3.7 实现的,开源并可在 https://github.com/CXX1113/Dual-DropoutGCN 上免费获取。
补充数据可在 Bioinformatics 在线获得。