Jackson Edmund S, Fitzgerald William J
Signal Processing Laboratory, Department of Engineering, Cambridge University, UK.
Bioinformatics. 2007 Jun 1;23(11):1313-20. doi: 10.1093/bioinformatics/btm054. Epub 2007 Mar 25.
A significant and stubbornly intractable problem in genome sequence analysis has been the de novo identification of transcription factor binding sites in promoter regions. Although theoretically pleasing, probabilistic methods have faced difficulties due to model mismatch and the nature of the biological sequence. These problems result in inference in a high dimensional, highly multimodal space, and consequently often display only local convergence and hence unsatisfactory performance.
In this article, we derive and demonstrate a novel method utilizing a sequential Monte Carlo-based expectation-maximization (EM) optimization to improve performance in this scenario. The Monte Carlo element should increase the robustness of the algorithm compared to classical EM. Furthermore, the parallel nature of the sequential Monte Carlo algorithm should be more robust than Gibbs sampling approaches to multimodality problems.
We demonstrate the superior performance of this algorithm on both semi-synthetic and real data from Escherichia coli.
http://sigproc-eng.cam.ac.uk/ approximately ej230/smc_em_tfbsid.tar.gz
基因组序列分析中一个重大且顽固棘手的问题一直是在启动子区域从头鉴定转录因子结合位点。尽管从理论上讲很有吸引力,但概率方法由于模型不匹配和生物序列的性质而面临困难。这些问题导致在高维、高度多模态空间中进行推断,因此常常仅表现出局部收敛,从而性能不尽人意。
在本文中,我们推导并展示了一种新颖的方法,该方法利用基于序贯蒙特卡罗的期望最大化(EM)优化来提升此场景下的性能。与经典EM相比,蒙特卡罗元素应能提高算法的稳健性。此外,序贯蒙特卡罗算法的并行特性对于多模态问题应比吉布斯采样方法更稳健。
我们在来自大肠杆菌的半合成数据和真实数据上都证明了该算法的卓越性能。
http://sigproc-eng.cam.ac.uk/ approximately ej230/smc_em_tfbsid.tar.gz