Kim Jan T, Gewehr Jan E, Martinetz Thomas
J Bioinform Comput Biol. 2004 Jun;2(2):289-307. doi: 10.1142/s0219720004000569.
Recognition of protein-DNA binding sites in genomic sequences is a crucial step for discovering biological functions of genomic sequences. Explosive growth in availability of sequence information has resulted in a demand for binding site detection methods with high specificity. The motivation of the work presented here is to address this demand by a systematic approach based on Maximum Likelihood Estimation. A general framework is developed in which a large class of binding site detection methods can be described in a uniform and consistent way. Protein-DNA binding is determined by binding energy, which is an approximately linear function within the space of sequence words. All matrix based binding word detectors can be regarded as different linear classifiers which attempt to estimate the linear separation implied by the binding energy function. The standard approaches of consensus sequences and profile matrices are described using this framework. A maximum likelihood approach for determining this linear separation leads to a novel matrix type, called the binding matrix. The binding matrix is the most specific matrix based classifier which is consistent with the input set of known binding words. It achieves significant improvements in specificity compared to other matrices. This is demonstrated using 95 sets of experimentally determined binding words provided by the TRANSFAC database.
识别基因组序列中的蛋白质-DNA结合位点是发现基因组序列生物学功能的关键步骤。序列信息可用性的爆炸式增长导致对具有高特异性的结合位点检测方法的需求。本文所呈现工作的动机旨在通过基于最大似然估计的系统方法来满足这一需求。我们开发了一个通用框架,在该框架中,可以以统一且一致的方式描述一大类结合位点检测方法。蛋白质-DNA结合由结合能决定,结合能在序列词空间内是一个近似线性函数。所有基于矩阵的结合词检测器都可被视为不同的线性分类器,它们试图估计由结合能函数所隐含的线性分离。使用此框架描述了共有序列和轮廓矩阵的标准方法。用于确定这种线性分离的最大似然方法导致了一种新型矩阵类型,称为结合矩阵。结合矩阵是基于矩阵的最具特异性的分类器,它与已知结合词的输入集一致。与其他矩阵相比,它在特异性方面有显著提高。这通过使用TRANSFAC数据库提供的95组实验确定的结合词得到了证明。