Nettling Martin, Treutler Hendrik, Cerquides Jesus, Grosse Ivo
Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
Leibniz Institute of Plant Biochemistry, Halle, Germany.
BMC Bioinformatics. 2017 Mar 1;18(1):141. doi: 10.1186/s12859-017-1495-1.
Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for de-novo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and non-phylogenetic approaches based on sequences from only one species that typically take into account intra-motif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intra-motif dependencies separately improves de-novo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously.
Here, we present an approach for de-novo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intra-motif dependencies. We study the degree of intra-motif dependencies inferred by this approach from ChIP-seq data of 35 transcription factors. We find that significant intra-motif dependencies of orders 1 and 2 are present in all 35 datasets and that intra-motif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intra-motif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1.
Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution.
转录基因调控是自然界中的一个基本过程,对DNA结合基序及其结合位点进行实验和计算研究是阐明这一过程的前提条件。从头基序发现方法可细分为系统发育足迹法,该方法考虑了多个物种比对序列中的系统发育依赖性,以及基于仅一个物种序列的非系统发育方法,后者通常考虑基序内依赖性。研究表明,分别对(i)系统发育依赖性和(ii)基序内依赖性进行建模可提高从头基序发现的效率,但目前尚无能够同时对(i)和(ii)进行建模的方法。
在此,我们提出一种从头基序发现方法,该方法将系统发育足迹法与能够考虑基序内依赖性的基序模型相结合。我们研究了通过该方法从35种转录因子的ChIP-seq数据中推断出的基序内依赖性程度。我们发现,在所有35个数据集中都存在显著的1阶和2阶基序内依赖性,并且2阶基序内依赖性通常比1阶更强。我们还发现,所提出的方法在所有35个数据集中都提高了系统发育足迹法的分类性能,并且纳入2阶基序内依赖性比仅纳入1阶基序内依赖性产生更高的分类性能。
将系统发育足迹法与纳入基序内依赖性的基序模型相结合,可提高转录因子结合位点分类的性能。这可能会促进我们对转录基因调控及其进化的理解。