Ozaki Haruka, Iwasaki Wataru
Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwanoha 5-1-5, Kashiwa, 277-8568 Chiba, Japan.
Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwanoha 5-1-5, Kashiwa, 277-8568 Chiba, Japan; Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, 113-0032 Tokyo, Japan; Atmosphere and Ocean Research Institute, The University of Tokyo, Kashiwanoha 5-1-5, Kashiwa, 277-8564 Chiba, Japan.
Comput Biol Chem. 2016 Aug;63:62-72. doi: 10.1016/j.compbiolchem.2016.01.014. Epub 2016 Feb 13.
As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif.
Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs.
By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions.
作为基因调控的关键机制,转录因子(TFs)通过识别被称为DNA结合基序的特定短序列模式与DNA结合。单个转录因子可以接受其DNA结合基序内的模糊性,这些基序包括典型(规范)和非典型基序。阐明这种DNA结合基序的模糊性对于揭示基因调控网络和评估顺式调控元件中的突变至关重要。尽管染色质免疫沉淀测序(ChIP-seq)现在提供了关于给定转录因子结合的基因组序列的大量数据,但现有的基序发现方法无法直接回答给定的转录因子是否能与特定的DNA结合基序结合。
在此,我们报告了一种用于阐明DNA结合基序模糊性的方法MOCCS。给定任何转录因子的ChIP-Seq数据,MOCCS全面分析并描述该转录因子结合的每个k-mer。对模拟数据集的分析表明,MOCCS适用于各种ChIP-Seq数据集,每个数据集只需几分钟。应用于ENCODE ChIP-Seq数据集证明,即使已知的位置权重矩阵模型没有提供关于DNA结合基序模糊性的足够信息,MOCCS也能直接评估给定的转录因子是否与每个DNA结合基序结合。此外,用户无需提供通常不可用的大量参数或背景基因组序列模型。MOCCS用Perl和R实现,可通过https://github.com/yuifu/moccs免费获取。
通过补充现有的基序发现软件,MOCCS将有助于从根本上理解基因组如何通过DNA-蛋白质相互作用控制各种细胞过程。