Chin Francis, Leung Henry C M
Department of Computer Science, The University of Hong Kong, Hong Kong.
IEEE/ACM Trans Comput Biol Bioinform. 2008 Jan-Mar;5(1):110-9. doi: 10.1109/TCBB.2007.70220.
The problem of discovering novel motifs of binding sites is important to the understanding of gene regulatory networks. Motifs are generally represented by matrices (position weight matrix (PWM) or position specific scoring matrix (PSSM) or strings. However, these representations cannot model biological binding sites well because they fail to capture nucleotide interdependence. It has been pointed out by many researchers that the nucleotides of the DNA binding site cannot be treated independently, e.g. the binding sites of zinc finger in proteins. In this paper, a new representation called Scored Position Specific Pattern (SPSP), which is a generalization of the matrix and string representations, is introduced which takes into consideration the dependent occurrences of neighboring nucleotides. Even though the problem of discovering the optimal motif in SPSP representation is proved to be NP-hard, we introduce a heuristic algorithm called SPSP-Finder, which can effectively find optimal motifs in most simulated cases and some real cases for which existing popular motif finding software, such as Weeder, MEME and AlignACE, fail.
发现结合位点的新基序问题对于理解基因调控网络至关重要。基序通常由矩阵(位置权重矩阵(PWM)或位置特异性评分矩阵(PSSM))或字符串表示。然而,这些表示不能很好地对生物结合位点进行建模,因为它们无法捕捉核苷酸的相互依赖性。许多研究人员指出,DNA结合位点的核苷酸不能被独立对待,例如蛋白质中锌指的结合位点。在本文中,引入了一种称为评分位置特异性模式(SPSP)的新表示,它是矩阵和字符串表示的推广,考虑了相邻核苷酸的相关出现情况。尽管在SPSP表示中发现最优基序的问题被证明是NP难的,但我们引入了一种称为SPSP-Finder的启发式算法,它可以在大多数模拟案例以及一些现有流行基序发现软件(如Weeder, MEME和AlignACE)失败的实际案例中有效地找到最优基序。