Klepper Kjetil, Sandve Geir K, Abul Osman, Johansen Jostein, Drablos Finn
Department of Cancer Reasearch and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
BMC Bioinformatics. 2008 Feb 26;9:123. doi: 10.1186/1471-2105-9-123.
Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery.
We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise.
Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.
调控元件的计算发现是生物信息学研究的一个重要领域,已发表了一百多种基序发现方法。传统上,这些方法大多解决的是单基序发现问题——发现单个转录因子的结合基序。然而,在高等生物中,转录因子通常与附近结合的因子协同作用以诱导特定的调控行为。因此,最近的研究重点已从单基序转移到发现由多个协同转录因子结合的基序集,即所谓的复合基序或顺式调控模块。鉴于现有方法数量众多且种类各异,对方法进行独立评估变得很重要。虽然已经有几项关于单基序发现的基准研究,但此前尚未针对复合基序发现进行类似研究。
我们开发了一个用于复合基序发现的基准框架,并使用它来评估八种已发表的模块发现工具的性能。基于包含经实验验证的调控模块的真实基因组序列构建基准数据集,并要求模块发现程序预测这些模块的位置并指定其中涉及的单基序。为帮助程序进行搜索,我们提供了与所涉及转录因子的结合基序相对应的位置权重矩阵。此外,在一个数据集上,将诱饵矩阵的选择与真实矩阵混合,以测试程序对不同噪声水平的响应。
虽然总体上一些测试方法的得分往往比其他方法略高,但各个数据集之间仍存在很大差异,没有一种方法在所有情况下都始终比其他方法表现得更好。单个数据集上性能的差异也表明,新的基准数据集对大多数模块发现方法构成了合适的各种挑战。