Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-12-S1-S16.
Discrimination of transcription factor binding sites (TFBS) from background sequences plays a key role in computational motif discovery. Current clustering based algorithms employ homogeneous model for problem solving, which assumes that motifs and background signals can be equivalently characterized. This assumption has some limitations because both sequence signals have distinct properties.
This paper aims to develop a Self-Organizing Map (SOM) based clustering algorithm for extracting binding sites in DNA sequences. Our framework is based on a novel intra-node soft competitive procedure to achieve maximum discrimination of motifs from background signals in datasets. The intra-node competition is based on an adaptive weighting technique on two different signal models to better represent these two classes of signals. Using several real and artificial datasets, we compared our proposed method with several motif discovery tools. Compared to SOMBRERO, a state-of-the-art SOM based motif discovery tool, it is found that our algorithm can achieve significant improvements in the average precision rates (i.e., about 27%) on the real datasets without compromising its sensitivity. Our method also performed favourably comparing against other motif discovery tools.
Motif discovery with model based clustering framework should consider the use of heterogeneous model to represent the two classes of signals in DNA sequences. Such heterogeneous model can achieve better signal discrimination compared to the homogeneous model.
从背景序列中区分转录因子结合位点(TFBS)在计算基序发现中起着关键作用。当前基于聚类的算法采用同质模型来解决问题,该模型假设基序和背景信号可以等效地描述。这种假设存在一些局限性,因为这两种序列信号具有不同的特性。
本文旨在开发一种基于自组织映射(SOM)的聚类算法,用于从 DNA 序列中提取结合位点。我们的框架基于一种新颖的节点内软竞争过程,以实现数据集中文本与背景信号的最大区分。节点内竞争基于两种不同信号模型的自适应加权技术,以更好地表示这两类信号。使用几个真实和人工数据集,我们将我们提出的方法与几个基序发现工具进行了比较。与最先进的基于 SOM 的基序发现工具 SOMBRERO 相比,发现在不影响其敏感性的情况下,我们的算法可以在真实数据集上显著提高平均精度(即约 27%)。我们的方法与其他基序发现工具相比也表现出色。
基于模型的聚类框架的基序发现应该考虑使用异构模型来表示 DNA 序列中的两类信号。与同质模型相比,这种异构模型可以实现更好的信号区分。