Ravichandran Prashanthi, Parsana Princy, Keener Rebecca, Hansen Kaspar D, Battle Alexis
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
bioRxiv. 2024 Jan 23:2024.01.20.576447. doi: 10.1101/2024.01.20.576447.
Gene co-expression networks (GCNs) describe relationships among expressed genes key to maintaining cellular identity and homeostasis. However, the small sample size of typical RNA-seq experiments which is several orders of magnitude fewer than the number of genes is too low to infer GCNs reliably. , a publicly available dataset comprised of 316,443 uniformly processed human RNA-seq samples, provides an opportunity to improve power for accurate network reconstruction and obtain biological insight from the resulting networks.
We compared alternate aggregation strategies to identify an optimal workflow for GCN inference by data aggregation and inferred three consensus networks: a universal network, a non-cancer network, and a cancer network in addition to 27 tissue context-specific networks. Central network genes from our consensus networks were enriched for evolutionarily constrained genes and ubiquitous biological pathways, whereas central context-specific network genes included tissue-specific transcription factors and factorization based on the hubs led to clustering of related tissue contexts. We discovered that annotations corresponding to context-specific networks inferred from aggregated data were enriched for trait heritability beyond known functional genomic annotations and were significantly more enriched when we aggregated over a larger number of samples.
This study outlines best practices for network GCN inference and evaluation by data aggregation. We recommend estimating and regressing confounders in each data set before aggregation and prioritizing large sample size studies for GCN reconstruction. Increased statistical power in inferring context-specific networks enabled the derivation of variant annotations that were enriched for concordant trait heritability independent of functional genomic annotations that are context-agnostic. While we observed strictly increasing held-out log-likelihood with data aggregation, we noted diminishing marginal improvements. Future directions aimed at alternate methods for estimating confounders and integrating orthogonal information from modalities such as Hi-C and ChIP-seq can further improve GCN inference.
基因共表达网络(GCN)描述了对于维持细胞特性和内稳态至关重要的已表达基因之间的关系。然而,典型RNA测序实验的样本量较小,比基因数量少几个数量级,以至于无法可靠地推断GCN。一个由316,443个经过统一处理的人类RNA测序样本组成的公开可用数据集,为提高准确网络重建的能力并从所得网络中获得生物学见解提供了机会。
我们比较了不同的聚合策略,以通过数据聚合确定用于GCN推断的最佳工作流程,并推断出三个共识网络:一个通用网络、一个非癌症网络和一个癌症网络,此外还有27个组织背景特异性网络。我们共识网络中的核心网络基因富含进化受限基因和普遍存在的生物学途径,而核心背景特异性网络基因包括组织特异性转录因子,并且基于中心节点的分解导致相关组织背景的聚类。我们发现,从聚合数据推断出的背景特异性网络的注释在已知功能基因组注释之外还富含性状遗传性,并且当我们对更多样本进行聚合时,其富集程度显著更高。
本研究概述了通过数据聚合进行网络GCN推断和评估的最佳实践。我们建议在聚合之前估计并回归每个数据集中的混杂因素,并优先进行大样本量研究以进行GCN重建。推断背景特异性网络时统计能力的提高使得能够得出变异注释,这些注释富含与性状遗传性一致的内容,而与不考虑背景的功能基因组注释无关。虽然我们观察到随着数据聚合,留出对数似然严格增加,但我们注意到边际改进在减少。旨在估计混杂因素的替代方法以及整合来自Hi-C和ChIP-seq等模式的正交信息的未来方向可以进一步改善GCN推断。