Xing Eric P, Wu Wei, Jordan Michael I, Karp Richard M
Computer Science Division, University of California, Berkeley, 94720, USA.
Proc IEEE Comput Soc Bioinform Conf. 2003;2:266-76.
The complexity of the global organization and internal structures of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection it is necessary to model the complex dependencies within and among motifs and incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependence within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila sequences with regard to sensitivity, specificity, flexibility and extensibility.
高等真核生物中基序的全球组织和内部结构的复杂性给基序检测技术带来了重大挑战。为了成功地进行从头基序检测,有必要对基序内部和之间的复杂依赖性进行建模,并纳入生物学先验知识。在本文中,我们提出了LOGOS,一种用于生物聚合物序列的局部和全局基序序列集成模型,它为开发、模块化、扩展和计算用于复杂生物聚合物序列分析的表达性基序模型提供了一个有原则的框架。LOGOS由两个相互作用的子模型组成:HMDM,一个捕捉生物学先验知识和基序局部结构内位置依赖性的局部比对模型;以及HMM,一个对基序出现的频率和依赖性进行建模的全局基序分布模型。模型参数可以在经验贝叶斯框架内使用训练基序进行拟合。我们开发了一种变分期望最大化算法用于从头基序检测。LOGOS优于现有的忽略生物学先验以及基序结构和基序出现中的依赖性的模型,并且在半现实测试数据以及来自酵母和果蝇序列的顺式调控序列上,在敏感性、特异性、灵活性和可扩展性方面都表现出卓越的性能。