Kommu Sindhura, Wang Yizhi, Wang Yue, Wang Xuan
Department of Computer Science, Virginia Tech, Blacksburg, 24061, Virginia, USA.
Department of Electrical and Computer Engineering, Virginia Tech, Arlington, 22203, Virginia, USA.
bioRxiv. 2025 Jan 29:2024.12.16.628715. doi: 10.1101/2024.12.16.628715.
Single-cell RNA sequencing (scRNA-seq) data offers unprecedented opportunities to infer gene regulatory networks (GRNs) at a fine-grained resolution, shedding light on cellular phenotypes at the molecular level. However, the high sparsity, noise, and dropout events inherent in scRNA-seq data pose significant challenges for accurate and reliable GRN inference. The rapid growth in experimentally validated transcription factor-DNA binding data (e.g., ChIP-seq) has enabled supervised machine learning methods, which rely on known gene regulatory interactions to learn patterns, and achieve high accuracy in GRN inference by framing it as a gene regulatory link prediction task. This study addresses the gene regulatory link prediction problem by learning informative vectorized representations at the gene level to predict missing regulatory interactions. However, a higher performance of supervised learning methods requires a large amount of known TF-DNA binding data, which is often experimentally expensive and therefore limited in amount. Advances in large-scale pre-training and transfer learning provide a transformative opportunity to address this challenge. In this study, we leverage large-scale pre-trained models, trained on extensive scRNA-seq datasets and known as single-cell foundation models (scFMs). These models are combined with joint graph-based learning to establish a robust foundation for gene regulatory link prediction.
We propose scRegNet, a novel and effective framework that leverages scFMs with joint graph-based learning for gene regulatory link prediction. scRegNet achieves state-of-the-art results in comparison with nine baseline methods on seven scRNA-seq benchmark datasets. In addition, scRegNet is more robust than the baseline methods on noisy training data.
The source code is available at https://github.com/sindhura-cs/scRegNet.
单细胞RNA测序(scRNA-seq)数据为在细粒度分辨率下推断基因调控网络(GRN)提供了前所未有的机会,从而在分子水平上揭示细胞表型。然而,scRNA-seq数据中固有的高稀疏性、噪声和缺失值事件对准确可靠的GRN推断提出了重大挑战。实验验证的转录因子-DNA结合数据(如ChIP-seq)的快速增长使得监督机器学习方法得以实现,这些方法依赖已知的基因调控相互作用来学习模式,并通过将其构建为基因调控链接预测任务在GRN推断中实现高精度。本研究通过在基因水平学习信息性向量表示来预测缺失的调控相互作用,解决基因调控链接预测问题。然而,监督学习方法的更高性能需要大量已知的TF-DNA结合数据,而这些数据通常在实验上成本高昂,因此数量有限。大规模预训练和迁移学习的进展为应对这一挑战提供了变革性机会。在本研究中,我们利用在广泛的scRNA-seq数据集上训练的大规模预训练模型,即单细胞基础模型(scFM)。这些模型与基于联合图的学习相结合,为基因调控链接预测建立了一个强大的基础。
我们提出了scRegNet,这是一个新颖且有效的框架,它利用scFM与基于联合图的学习进行基因调控链接预测。与七种scRNA-seq基准数据集上的九种基线方法相比,scRegNet取得了领先的结果。此外,在有噪声的训练数据上,scRegNet比基线方法更稳健。