Saxena Ankita, Nixon Bridgette, Boyd Amelia, Evans James, Faraone Stephen V
Department of Neuroscience and Physiology, State University of New York-Norton College of Medicine at Upstate Medical University, New York, USA.
Department of Psychiatry and Behavioral Sciences, State University of new York-Norton College of Medicine at Upstate Medical University, New York, USA.
Am J Med Genet B Neuropsychiatr Genet. 2025 Sep;198(6):3-18. doi: 10.1002/ajmg.b.33031. Epub 2025 May 2.
The development of high throughput technologies has resulted in the collection of large quantities of genomic and transcriptomic data. However, identifying disease-associated genes or networks from these data has remained an ongoing challenge. In recent years, graph neural networks (GNNs) have emerged as a promising analytical tool, but it is not well understood which characteristics of these models result in improved performance. We conducted a systematic search and review of publications that used GNNs to identify disease-associated biological interactions. Information was extracted about model characteristics and performance with the goal of examining the relationship between these factors and performance. Data leakage was found in 31% of these models. For node level tasks, univariate positive associations were identified between model accuracy and use of hyper parameter optimization, data leakage via hyperparameter optimization, test set size, and total dataset size. Among graph level tasks, an increase in AUC was identified in association with testing method and a decrease with optimization reporting. Data leakage may pose an issue for GNN-based approaches; the adoption of best practice guidelines and consistent reporting of model design would be beneficial for future studies.
高通量技术的发展使得大量基因组和转录组数据得以收集。然而,从这些数据中识别疾病相关基因或网络仍然是一个持续存在的挑战。近年来,图神经网络(GNN)已成为一种很有前景的分析工具,但对于这些模型的哪些特征能带来性能提升,人们还了解得不够透彻。我们对使用GNN识别疾病相关生物相互作用的出版物进行了系统的检索和综述。提取了有关模型特征和性能的信息,目的是研究这些因素与性能之间的关系。在这些模型中,31%存在数据泄露问题。对于节点级任务,在模型准确性与超参数优化的使用、通过超参数优化导致的数据泄露、测试集大小以及总数据集大小之间发现了单变量正相关关系。在图级任务中,发现AUC的增加与测试方法有关,而与优化报告有关的则有所下降。数据泄露可能给基于GNN的方法带来问题;采用最佳实践指南并一致报告模型设计将对未来的研究有益。