基于使用L1000数据的图神经网络的基因表达推断

Gene expression inference based on graph neural networks using L1000 data.

作者信息

Kim Tae Hyun, Kim Harim, Hwang Hyunjin, Kang Shinwhan, Shin Kijung, Baek Inwha

机构信息

Department of Regulatory Science, Graduate School, Kyung Hee University, 26 Kyungheedae-ro, Dongdaemun District, Seoul 02447, South Korea.

College of Pharmacy, Kyung Hee University, 26 Kyungheedae-ro, Dongdaemun District, Seoul 02447, South Korea.

出版信息

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf273.

DOI:10.1093/bib/bbaf273

PMID:40505083

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12161499/

Abstract

Gene expression profiles can serve as proxies for cellular states and provide valuable insights into the discovery of functional connections across diverse cellular contexts. A cost-effective method called L1000 has been developed to generate gene expression profiles for over a million different conditions. Since gene expression inference of this method relies on linear regression, nonlinear regression methods, including deep learning models, have been assessed. However, these approaches process gene expression data as a vector structure, motivating us to investigate whether nonlinear models based on a graph structure are more effective in capturing the relationships between genes underlying gene expression profiles. In this work, we show that the graph neural network (GNN) model with genes as nodes outperforms both linear and nonlinear non-GNN models in predicting gene expression values and expression-based gene rankings. Importantly, our GNN model requires ~10-fold less information than other models to achieve comparable performance. A strategic selection of input features, or incorporating an organ feature, from which the gene expression data are derived, further improves gene expression inference performance of the GNN model. Additionally, we evaluate the cross-platform generality of gene expression inference. Our study demonstrates that the transformation of RNA expression data into a graph structure effectively captures nonlinear correlations between genes, thereby enabling highly accurate and efficient prediction of gene expression profiles.

摘要

基因表达谱可作为细胞状态的代理，并为发现不同细胞环境中的功能联系提供有价值的见解。一种名为L1000的经济高效方法已被开发出来，用于生成超过一百万个不同条件下的基因表达谱。由于该方法的基因表达推断依赖于线性回归，因此包括深度学习模型在内的非线性回归方法已被评估。然而，这些方法将基因表达数据作为向量结构进行处理，这促使我们研究基于图结构的非线性模型在捕捉基因表达谱背后基因之间的关系方面是否更有效。在这项工作中，我们表明以基因为节点的图神经网络（GNN）模型在预测基因表达值和基于表达的基因排名方面优于线性和非线性非GNN模型。重要的是，我们的GNN模型在实现可比性能时所需的信息比其他模型少约10倍。从基因表达数据所源自的器官特征中进行输入特征的策略性选择或纳入该器官特征，可进一步提高GNN模型的基因表达推断性能。此外，我们评估了基因表达推断的跨平台通用性。我们的研究表明，将RNA表达数据转换为图结构可有效捕捉基因之间的非线性相关性，从而实现对基因表达谱的高度准确和高效预测。