Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Nucleic Acids Res. 2021 Oct 11;49(18):10309-10327. doi: 10.1093/nar/gkab765.
Deciphering the sequence-function relationship encoded in enhancers holds the key to interpreting non-coding variants and understanding mechanisms of transcriptomic variation. Several quantitative models exist for predicting enhancer function and underlying mechanisms; however, there has been no systematic comparison of these models characterizing their relative strengths and shortcomings. Here, we interrogated a rich data set of neuroectodermal enhancers in Drosophila, representing cis- and trans- sources of expression variation, with a suite of biophysical and machine learning models. We performed rigorous comparisons of thermodynamics-based models implementing different mechanisms of activation, repression and cooperativity. Moreover, we developed a convolutional neural network (CNN) model, called CoNSEPT, that learns enhancer 'grammar' in an unbiased manner. CoNSEPT is the first general-purpose CNN tool for predicting enhancer function in varying conditions, such as different cell types and experimental conditions, and we show that such complex models can suggest interpretable mechanisms. We found model-based evidence for mechanisms previously established for the studied system, including cooperative activation and short-range repression. The data also favored one hypothesized activation mechanism over another and suggested an intriguing role for a direct, distance-independent repression mechanism. Our modeling shows that while fundamentally different models can yield similar fits to data, they vary in their utility for mechanistic inference. CoNSEPT is freely available at: https://github.com/PayamDiba/CoNSEPT.
破译增强子中编码的序列-功能关系是解释非编码变异和理解转录组变异机制的关键。有几种定量模型可用于预测增强子功能和潜在机制;然而,这些模型的相对优势和劣势还没有进行系统比较。在这里,我们使用一系列生物物理和机器学习模型,研究了富含神经外胚层增强子的数据集,这些增强子代表了表达变异的顺式和反式来源。我们对基于热力学的模型进行了严格的比较,这些模型实现了不同的激活、抑制和协同作用机制。此外,我们开发了一种名为 CoNSEPT 的卷积神经网络 (CNN) 模型,它以一种无偏的方式学习增强子的“语法”。CoNSEPT 是第一个用于预测不同条件(如不同细胞类型和实验条件)下增强子功能的通用 CNN 工具,我们表明,这种复杂的模型可以提出可解释的机制。我们基于模型的证据表明,对于所研究系统中已经建立的机制,包括协同激活和短程抑制,都具有一定的合理性。该数据还支持一种假设的激活机制而不是另一种机制,并提出了一种直接的、不依赖距离的抑制机制的有趣作用。我们的模型表明,虽然根本不同的模型可以对数据产生相似的拟合,但它们在进行机制推断方面的效用却有所不同。CoNSEPT 可在以下网址免费获取:https://github.com/PayamDiba/CoNSEPT。