Yeo Gene, Burge Christopher B
Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue Building 68-223, Cambridge, MA 02319, USA.
J Comput Biol. 2004;11(2-3):377-94. doi: 10.1089/1066527041410418.
We propose a framework for modeling sequence motifs based on the maximum entropy principle (MEP). We recommend approximating short sequence motif distributions with the maximum entropy distribution (MED) consistent with low-order marginal constraints estimated from available data, which may include dependencies between nonadjacent as well as adjacent positions. Many maximum entropy models (MEMs) are specified by simply changing the set of constraints. Such models can be utilized to discriminate between signals and decoys. Classification performance using different MEMs gives insight into the relative importance of dependencies between different positions. We apply our framework to large datasets of RNA splicing signals. Our best models out-perform previous probabilistic models in the discrimination of human 5' (donor) and 3' (acceptor) splice sites from decoys. Finally, we discuss mechanistically motivated ways of comparing models.
我们提出了一个基于最大熵原理(MEP)对序列基序进行建模的框架。我们建议用与从可用数据估计的低阶边际约束相一致的最大熵分布(MED)来近似短序列基序分布,这些约束可能包括非相邻以及相邻位置之间的依赖性。许多最大熵模型(MEM)只需通过改变约束集来指定。这样的模型可用于区分信号和诱饵。使用不同MEM的分类性能能够深入了解不同位置之间依赖性的相对重要性。我们将我们的框架应用于RNA剪接信号的大型数据集。在从诱饵中区分人类5'(供体)和3'(受体)剪接位点方面,我们的最佳模型优于先前的概率模型。最后,我们讨论了基于机制的模型比较方法。