Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, D-06484 Quedlinburg, Germany
Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle (Saale), Germany.
Nucleic Acids Res. 2015 Oct 15;43(18):e119. doi: 10.1093/nar/gkv577. Epub 2015 Jun 26.
Binding of transcription factors to DNA is one of the keystones of gene regulation. The existence of statistical dependencies between binding site positions is widely accepted, while their relevance for computational predictions has been debated. Building probabilistic models of binding sites that may capture dependencies is still challenging, since the most successful motif discovery approaches require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that combine putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously. We find that Slim models yield a substantially better prediction performance than previous models on genomic context protein binding microarray data sets and on ChIP-seq data sets. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The observed dependency structures range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions.
转录因子与 DNA 的结合是基因调控的关键之一。尽管广泛接受了结合位点位置之间存在统计依赖性,但它们对计算预测的相关性仍存在争议。构建能够捕捉依赖性的结合位点概率模型仍然具有挑战性,因为最成功的基序发现方法需要数值优化技术,而这些技术不适合选择依赖结构。为了克服这个问题,我们提出了稀疏局部非均匀混合(Slim)模型,该模型以加权的方式组合了假定的依赖结构,允许同时对依赖结构和模型参数进行数值优化。我们发现,Slim 模型在基因组上下文蛋白结合微阵列数据集和 ChIP-seq 数据集上的预测性能明显优于以前的模型。为了阐明性能提高的原因,我们开发了依赖 logo,它允许在结合位点内可视化检查依赖结构。我们发现,Slim 模型发现的依赖结构非常多样化,并且高度特定于转录因子,这强调了需要灵活的依赖模型。观察到的依赖结构范围从广泛的异质性到相邻和非相邻结合位点位置之间的稀疏依赖性。