Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Proc Natl Acad Sci U S A. 2020 Oct 13;117(41):25655-25666. doi: 10.1073/pnas.2011795117. Epub 2020 Sep 25.
Although we know many sequence-specific transcription factors (TFs), how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine immune cell differentiation is beyond our grasp. Leveraging a granular atlas of chromatin accessibility across 81 immune cell types, we asked if a convolutional neural network (CNN) could learn to infer cell type-specific chromatin accessibility solely from regulatory DNA sequences. With a tailored architecture and an ensemble approach to CNN parameter interpretation, we show that our trained network ("AI-TAC") does so by rediscovering ab initio the binding motifs for known regulators and some unknown ones. Motifs whose importance is learned virtually as functionally important overlap strikingly well with positions determined by chromatin immunoprecipitation for several TFs. AI-TAC establishes a hierarchy of TFs and their interactions that drives lineage specification and also identifies stage-specific interactions, like Pax5/Ebf1 vs. Pax5/Prdm1, or the role of different NF-κB dimers in different cell types. AI-TAC assigns Spi1/Cebp and Pax5/Ebf1 as the drivers necessary for myeloid and B lineage fates, respectively, but no factors seemed as dominantly required for T cell differentiation, which may represent a fall-back pathway. Mouse-trained AI-TAC can parse human DNA, revealing a strikingly similar ranking of influential TFs and providing additional support that AI-TAC is a generalizable regulatory sequence decoder. Thus, deep learning can reveal the regulatory syntax predictive of the full differentiative complexity of the immune system.
尽管我们已经了解了许多序列特异性转录因子(TFs),但是顺式调控元件的 DNA 序列如何在基因组范围内被解码和协调,以决定免疫细胞的分化,这仍然超出了我们的理解范围。利用 81 种免疫细胞类型的染色质可及性的精细图谱,我们询问卷积神经网络(CNN)是否可以仅从调控 DNA 序列中学习推断细胞类型特异性染色质可及性。通过专门的架构和 CNN 参数解释的集成方法,我们表明,我们训练的网络(“AI-TAC”)通过从头开始重新发现已知调控因子和一些未知调控因子的结合基序来实现这一点。从功能上重要的角度学习到的重要性的基序与几种 TF 的染色质免疫沉淀确定的位置惊人地吻合。AI-TAC 建立了一个 TF 及其相互作用的层次结构,该结构驱动谱系特化,并且还鉴定了特定于阶段的相互作用,例如 Pax5/Ebf1 与 Pax5/Prdm1 ,或不同 NF-κB 二聚体在不同细胞类型中的作用。AI-TAC 将 Spi1/Cebp 和 Pax5/Ebf1 分别分配为髓系和 B 谱系命运所必需的驱动因子,但没有任何因子似乎对 T 细胞分化具有明显的必需性,这可能代表一种备用途径。经小鼠训练的 AI-TAC 可以解析人类 DNA,揭示出具有惊人相似影响力的 TF 排名,并提供了更多支持,表明 AI-TAC 是一种可推广的调控序列解码器。因此,深度学习可以揭示预测免疫系统全部分化复杂性的调控语法。