Suppr超能文献

在深度学习框架中学习和解释基因调控语法。

Learning and interpreting the gene regulatory grammar in a deep learning framework.

机构信息

Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America.

Vanderbilt Genetics Institute and Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America.

出版信息

PLoS Comput Biol. 2020 Nov 2;16(11):e1008334. doi: 10.1371/journal.pcbi.1008334. eCollection 2020 Nov.

Abstract

Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.

摘要

深度神经网络 (DNN) 在识别基因调控序列方面取得了最先进的性能,但由于难以解释它们学习到的复杂特征,它们对调控元件的生物学提供的见解有限。已经提出了几种组合转录因子结合的模型,即调控语法,来驱动增强子活性,范围从灵活的 TF 广告牌模型到严格的增强子体模型。然而,关于这些(或其他)序列结构在增强子中的普遍性知之甚少。在这里,我们进行了几项假设驱动的分析,以探索 DNN 学习增强子调控语法的能力。我们根据组合转录因子结合位点 (TFBS) 模式的现有假设,从来自不同 TF 家族的真实 TF 结合基序中创建了基于合成的数据集,包括同型簇、异型簇和增强子体。然后,我们训练深度残差神经网络 (ResNet) 来模拟一系列反映真实世界多标签调控序列预测任务的场景下的序列。我们开发了一种基于梯度的无监督聚类方法来提取 ResNet 模型学习到的模式。我们证明了模拟调控语法在 ResNet 的倒数第二层中学习得最好,并且即使增强子类别存在异质性并且 TFBS 的很大一部分在调控语法之外,所提出的方法也可以准确地检索调控语法。然而,我们还确定了 ResNet 无法学习模拟调控语法的常见情况。最后,我们将所提出的方法应用于小鼠发育增强子,并能够识别已知异型 TF 簇的成分。我们的结果为解释 ResNet 学习到的调控规则提供了一个框架,并证明了 ResNet 学习调控语法的能力和效率取决于预测任务的性质。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7588/7660921/713e334f58ed/pcbi.1008334.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验