MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China.
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
Bioinformatics. 2018 Mar 1;34(5):732-738. doi: 10.1093/bioinformatics/btx679.
A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies.
We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases.
Deopen is freely available at https://github.com/kimmo1019/Deopen.
Supplementary data are available at Bioinformatics online.
大多数与人类遗传性疾病相关的已知遗传变异都位于非编码区域,这些区域缺乏充分的解释,因此系统地在全基因组水平上发现功能位点,并全面准确地破译它们的含义是必不可少的。尽管计算方法一直在补充高通量生物实验,以注释人类基因组,但通过从大规模测序数据中自动学习 DNA 序列代码来准确注释特定细胞类型中的调控元件仍然是一个巨大的挑战。事实上,开发一个准确且可解释的模型来学习 DNA 序列特征,并进一步实现对致病遗传变异的识别,在基因组和遗传学研究中都变得至关重要。
我们提出了 Deopen,这是一个主要基于深度卷积神经网络的混合框架,用于自动学习 DNA 序列的调控代码并预测染色质可及性。在与现有方法的一系列比较中,我们不仅展示了我们的模型在区分可及区域与随机采样的背景序列的分类任务中的优越性能,还展示了其在 DNase-seq 信号回归任务中的优越性能。此外,我们进一步可视化了卷积核,并展示了识别出的序列特征与已知基序的匹配。我们最后通过在乳腺癌数据集的分析中展示了我们的模型在发现致病非编码变异方面的敏感性,证明了我们模型的有效性。我们期望看到 Deopen 在注释人类基因组和识别与疾病相关的非编码变异方面,无论是在公共还是内部染色质可及性数据上都得到广泛应用。
Deopen 可在 https://github.com/kimmo1019/Deopen 上免费获取。
补充数据可在 Bioinformatics 在线获取。