Wu Zijun, Sinha Saurabh
Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
H. Milton Steward School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30318, USA.
bioRxiv. 2023 Nov 13:2023.11.09.566399. doi: 10.1101/2023.11.09.566399.
Reconstruction of gene regulatory networks (GRNs) from expression data is a significant open problem. Common approaches train a machine learning (ML) model to predict a gene's expression using transcription factors' (TFs') expression as features and designate important features/TFs as regulators of the gene. Here, we present an entirely different paradigm, where GRN edges are directly predicted by the ML model. The new approach, named "SPREd" is a simulation-supervised neural network for GRN inference. Its inputs comprise expression relationships (e.g., correlation, mutual information) between the target gene and each TF and between pairs of TFs. The output includes binary labels indicating whether each TF regulates the target gene. We train the neural network model using synthetic expression data generated by a biophysics-inspired simulation model that incorporates linear as well as non-linear TF-gene relationships and diverse GRN configurations. We show SPREd to outperform state-of-the-art GRN reconstruction tools GENIE3, ENNET, PORTIA and TIGRESS on synthetic datasets with high co-expression among TFs, similar to that seen in real data. A key advantage of the new approach is its robustness to relatively small numbers of conditions (columns) in the expression matrix, which is a common problem faced by existing methods. Finally, we evaluate SPREd on real data sets in yeast that represent gold standard benchmarks of GRN reconstruction and show it to perform significantly better than or comparably to existing methods. In addition to its high accuracy and speed, SPREd marks a first step towards incorporating biophysics principles of gene regulation into ML-based approaches to GRN reconstruction.
从表达数据重建基因调控网络(GRN)是一个重大的开放性问题。常见方法是训练机器学习(ML)模型,以转录因子(TF)的表达作为特征来预测基因的表达,并将重要特征/TF指定为该基因的调控因子。在此,我们提出一种截然不同的范式,其中GRN边由ML模型直接预测。这种名为“SPREd”的新方法是一种用于GRN推理的模拟监督神经网络。其输入包括目标基因与每个TF之间以及TF对之间的表达关系(例如,相关性、互信息)。输出包括二进制标签,指示每个TF是否调控目标基因。我们使用由生物物理启发的模拟模型生成的合成表达数据训练神经网络模型,该模型纳入了线性以及非线性TF-基因关系和多种GRN配置。我们表明,在TF之间具有高共表达的合成数据集上,SPREd的性能优于现有GRN重建工具GENIE3、ENNET、PORTIA和TIGRESS,这与真实数据中的情况类似。新方法的一个关键优势是其对表达矩阵中相对较少数量的条件(列)具有鲁棒性,这是现有方法面临的一个常见问题。最后,我们在酵母的真实数据集上评估SPREd,这些数据集代表GRN重建的金标准基准,并表明它的性能明显优于现有方法或与之相当。除了具有高精度和高速度外,SPREd标志着朝着将基因调控的生物物理原理纳入基于ML的GRN重建方法迈出了第一步。