School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK.
School of Natural and Computing Science, University of Aberdeen King's College, Aberdeen, AB24 3FX, UK.
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae380.
Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
了解疾病的遗传基础是医学研究的一个基本方面,因为基因是经典的遗传单位,在生物功能中起着至关重要的作用。鉴定基因与疾病之间的关联对于诊断、预防、预后和药物开发至关重要。编码具有相似序列的蛋白质的基因通常与相关疾病有关,因为导致相同或相似疾病的蛋白质在其序列中往往表现出有限的变化。预测基因-疾病关联(GDA)需要在大量潜在候选基因上进行耗时且昂贵的实验。尽管已经提出了使用传统机器学习算法和图神经网络预测基因和疾病之间关联的方法,但这些方法难以捕捉基因和疾病内部的深层语义信息,并且依赖于训练数据。为了解决这个问题,我们提出了一种名为 FusionGDA 的新型 GDA 预测模型,该模型利用融合模块的预训练阶段来丰富由预训练语言模型编码的基因和疾病语义表示。融合模块生成多模态表示,其中包括两种异质生物医学实体的丰富语义信息:蛋白质序列和疾病描述。随后,采用池化聚合策略来压缩多模态表示的维度。此外,FusionGDA 还采用了预训练阶段,利用对比学习损失来通过在大型公共 GDA 数据集上进行训练来提取潜在的基因和疾病特征。为了严格评估 FusionGDA 模型的有效性,我们在五个数据集上进行了全面的实验,并在 DisGeNet-Eval 数据集上与五个竞争基线模型进行了比较。值得注意的是,我们的案例研究进一步证明了 FusionGDA 有效发现隐藏关联的能力。我们实验的完整代码和数据集可在 https://github.com/ZhaohanM/FusionGDA 上获得。