Bhogale Shounak, Seward Chris, Stubbs Lisa, Sinha Saurabh
Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Pacific Northwest Research Insititute, Seattle WA 98122.
bioRxiv. 2023 Nov 13:2023.11.09.565900. doi: 10.1101/2023.11.09.565900.
A common way to investigate gene regulatory mechanisms is to identify differentially expressed genes using transcriptomics, find their candidate enhancers using epigenomics, and search for over-represented transcription factor (TF) motifs in these enhancers using bioinformatics tools. A related follow-up task is to model gene expression as a function of enhancer sequences and rank TF motifs by their contribution to such models, thus prioritizing among regulators. We present a new computational tool called SEAMoD that performs the above tasks of motif finding and sequence-to-expression modeling simultaneously. It trains a convolutional neural network model to relate enhancer sequences to differential expression in one or more biological conditions. The model uses TF motifs to interpret the sequences, learning these motifs and their relative importance to each biological condition from data. It also utilizes epigenomic information in the form of activity scores of putative enhancers and automatically searches for the most promising enhancer for each gene. Compared to existing neural network models of non-coding sequences, SEAMoD uses far fewer parameters, requires far less training data, and emphasizes biological interpretability. We used SEAMoD to understand regulatory mechanisms underlying the differentiation of neural stem cell (NSC) derived from mouse forebrain. We profiled gene expression and histone modifications in NSC and three differentiated cell types and used SEAMoD to model differential expression of nearly 12,000 genes with an accuracy of 81%, in the process identifying the Olig2, E2f family TFs, Foxo3, and Tcf4 as key transcriptional regulators of the differentiation process.
研究基因调控机制的一种常见方法是,利用转录组学鉴定差异表达基因,利用表观基因组学找到它们的候选增强子,并使用生物信息学工具在这些增强子中搜索过度富集的转录因子(TF)基序。一个相关的后续任务是将基因表达建模为增强子序列的函数,并根据TF基序对这种模型的贡献对其进行排序,从而在调控因子中进行优先级排序。我们提出了一种名为SEAMoD的新计算工具,它能同时执行上述基序查找和序列到表达建模的任务。它训练一个卷积神经网络模型,将增强子序列与一种或多种生物学条件下的差异表达联系起来。该模型利用TF基序来解释序列,从数据中学习这些基序及其对每种生物学条件的相对重要性。它还利用推定增强子活性评分形式的表观基因组信息,并自动为每个基因搜索最有前景的增强子。与现有的非编码序列神经网络模型相比,SEAMoD使用的参数要少得多,所需的训练数据也要少得多,并且强调生物学可解释性。我们使用SEAMoD来理解源自小鼠前脑的神经干细胞(NSC)分化的调控机制。我们分析了NSC和三种分化细胞类型中的基因表达和组蛋白修饰,并使用SEAMoD对近12000个基因的差异表达进行建模,准确率达到81%,在此过程中确定Olig2、E2f家族转录因子、Foxo3和Tcf4是分化过程的关键转录调控因子。