Down Thomas A, Hubbard Tim J P
Wellcome Trust Sanger Institute, Hinxton Cambridge, CB10 1SA, UK.
Nucleic Acids Res. 2005 Mar 10;33(5):1445-53. doi: 10.1093/nar/gki282. Print 2005.
NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.
NestedMICA是一种全新的、可扩展的模式发现系统,用于在生物序列中寻找转录因子结合位点和类似基序。与之前的几种方法一样,NestedMICA通过优化概率混合模型以拟合一组序列来解决这个问题。然而,使用一种名为嵌套采样的新开发推理策略意味着NestedMICA能够找到最优解,而无需进行有问题的初始化或种子步骤。我们在一系列场景中、在合成数据和一组特征明确的肌肉调节区域上研究了NestedMICA的性能,并将其与流行的MEME程序进行比较。我们表明,新方法比MEME显著更灵敏:在一个案例中,它成功地从比现有程序所能处理的背景序列长四倍的序列中提取了目标基序。它在包含多个显著基序的合成序列上也表现稳健。当在一组真实的调控序列上进行测试时,NestedMICA产生的基序是所有五类丰富注释结合位点的良好预测指标。