Pan Yiru, Ji Xingyu, You Jiaqi, Li Lu, Liu Zhenping, Zhang Xianlong, Zhang Zeyu, Wang Maojun
National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, 430070 Hubei, China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf062.
Positive and negative association prediction between gene and phenotype helps to illustrate the underlying mechanism of complex traits in organisms. The transcription and regulation activity of specific genes will be adjusted accordingly in different cell types, developmental timepoints, and physiological states. There are the following two problems in obtaining the positive/negative associations between gene and phenotype: (1) high-throughput DNA/RNA sequencing and phenotyping are expensive and time-consuming due to the need to process large sample sizes; (2) experiments introduce both random and systematic errors, and, meanwhile, calculations or predictions using software or models may produce noise. To address these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy. CSGDN uses a signed graph diffusion method to uncover the underlying regulatory associations between genes and phenotypes. Then, stochastic perturbation strategies are used to create two views for both original and diffusive graphs. Lastly, a multiview contrastive learning paradigm loss is designed to unify the node presentations learned from the two views to resist interference and reduce noise. We perform experiments to validate the performance of CSGDN in three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. The results show that the proposed model outperforms state-of-the-art methods by up to 9. 28% AUC for the prediction of link sign in the G. hirsutum dataset. The source code of our model is available at https://github.com/Erican-Ji/CSGDN.
基因与表型之间正负关联预测有助于阐明生物体复杂性状的潜在机制。特定基因的转录和调控活性会在不同细胞类型、发育时间点和生理状态下相应地进行调整。在获取基因与表型之间的正负关联时存在以下两个问题:(1)由于需要处理大量样本,高通量DNA/RNA测序和表型分析既昂贵又耗时;(2)实验会引入随机误差和系统误差,同时,使用软件或模型进行计算或预测可能会产生噪声。为了解决这两个问题,我们提出了一种对比符号图扩散网络(CSGDN),以使用较少的训练样本学习稳健的节点表示,从而实现更高的链接预测准确性。CSGDN使用符号图扩散方法来揭示基因与表型之间潜在的调控关联。然后,采用随机扰动策略为原始图和扩散图创建两个视图。最后,设计了一种多视图对比学习范式损失,以统一从两个视图中学到的节点表示,从而抵抗干扰并减少噪声。我们进行实验以验证CSGDN在三个作物数据集(陆地棉、甘蓝型油菜和硬粒小麦)中的性能。结果表明,对于陆地棉数据集中链接符号的预测,所提出的模型比现有方法的AUC最高高出9.28%。我们模型的源代码可在https://github.com/Erican-Ji/CSGDN获取。