Yuan Qiuyue, Duren Zhana
Center for Human Genetics, Department of Genetics and Biochemistry, Clemson University, Greenwood, SC 29646, USA.
bioRxiv. 2023 Aug 3:2023.08.01.551575. doi: 10.1101/2023.08.01.551575.
Accurate context-specific Gene Regulatory Networks (GRNs) inference from genomics data is a crucial task in computational biology. However, existing methods face limitations, such as reliance on gene expression data alone, lower resolution from bulk data, and data scarcity for specific cellular systems. Despite recent technological advancements, including single-cell sequencing and the integration of ATAC-seq and RNA-seq data, learning such complex mechanisms from limited independent data points still presents a daunting challenge, impeding GRN inference accuracy. To overcome this challenge, we present LINGER (LIfelong neural Network for GEne Regulation), a novel deep learning-based method to infer GRNs from single-cell multiome data with paired gene expression and chromatin accessibility data from the same cell. LINGER incorporates both 1) atlas-scale external bulk data across diverse cellular contexts and 2) the knowledge of transcription factor (TF) motif matching to -regulatory elements as a manifold regularization to address the challenge of limited data and extensive parameter space in GRN inference. Our results demonstrate that LINGER achieves 2-3 fold higher accuracy over existing methods. LINGER reveals a complex regulatory landscape of genome-wide association studies, enabling enhanced interpretation of disease-associated variants and genes. Additionally, following the GRN inference from a reference sc-multiome data, LINGER allows for the estimation of TF activity solely from bulk or single-cell gene expression data, leveraging the abundance of available gene expression data to identify driver regulators from case-control studies. Overall, LINGER provides a comprehensive tool for robust gene regulation inference from genomics data, empowering deeper insights into cellular mechanisms.
从基因组学数据中准确推断特定背景下的基因调控网络(GRN)是计算生物学中的一项关键任务。然而,现有方法存在局限性,例如仅依赖基因表达数据、批量数据分辨率较低以及特定细胞系统的数据稀缺。尽管最近有技术进步,包括单细胞测序以及ATAC-seq和RNA-seq数据的整合,但从有限的独立数据点学习如此复杂的机制仍然是一项艰巨的挑战,阻碍了GRN推断的准确性。为了克服这一挑战,我们提出了LINGER(用于基因调控的终身神经网络),这是一种基于深度学习的新方法,用于从单细胞多组学数据中推断GRN,该数据包含来自同一细胞的配对基因表达和染色质可及性数据。LINGER整合了1)跨不同细胞背景的图谱规模外部批量数据和2)转录因子(TF)基序与调控元件匹配的知识作为一种流形正则化,以应对GRN推断中数据有限和参数空间广泛的挑战。我们的结果表明,LINGER比现有方法的准确性高出2至3倍。LINGER揭示了全基因组关联研究的复杂调控格局,能够增强对疾病相关变异和基因的解释。此外,在从参考单细胞多组学数据推断GRN之后,LINGER允许仅从批量或单细胞基因表达数据估计TF活性,利用大量可用的基因表达数据从病例对照研究中识别驱动调节因子。总体而言,LINGER为从基因组学数据进行稳健的基因调控推断提供了一个全面的工具,有助于更深入地了解细胞机制。