Olman Victor, Xu Dong, Xu Ying
Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480, USA.
J Bioinform Comput Biol. 2003 Apr;1(1):21-40. doi: 10.1142/s0219720003000162.
Transcription factor binding sites are short fragments in the upstream regions of genes, to which transcription factors bind to regulate the transcription of genes into mRNA. Computational identification of transcription factor binding sites remains an unsolved challenging problem though a great amount of effort has been put into the study of this problem. We have recently developed a novel technique for identification of binding sites from a set of upstream regions of genes, that could possibly be transcriptionally co-regulated and hence might share similar transcription factor binding sites. By utilizing two key features of such binding sites (i.e. their high sequence similarities and their relatively high frequencies compared to other sequence fragments), we have formulated this problem as a cluster identification problem. That is to identify and extract data clusters from a noisy background. While the classical data clustering problem (partitioning a data set into clusters sharing common or similar features) has been extensively studied, there is no general algorithm for solving the problem of identifying data clusters from a noisy background. In this paper, we present a novel algorithm for solving such a problem. We have proved that a cluster identification problem, under our definition, can be rigorously and efficiently solved through searching for substrings with special properties in a linear sequence. We have also developed a method for assessing the statistical significance of each identified cluster, which can be used to rule out accidental data clusters. We have implemented the cluster identification algorithm and the statistical significance analysis method as a computer software CUBIC. Extensive testing on CUBIC has been carried out. We present here a few applications of CUBIC on challenging cases of binding site identification.
转录因子结合位点是基因上游区域的短片段,转录因子与之结合以调控基因转录为信使核糖核酸(mRNA)。尽管在该问题的研究上已投入大量精力,但转录因子结合位点的计算识别仍是一个未解决的挑战性问题。我们最近开发了一种新技术,用于从一组基因上游区域中识别结合位点,这些区域可能受到转录共调控,因此可能共享相似的转录因子结合位点。通过利用此类结合位点的两个关键特征(即它们的高序列相似性以及与其他序列片段相比相对较高的频率),我们将此问题表述为一个聚类识别问题。也就是说,要从有噪声的背景中识别并提取数据聚类。虽然经典的数据聚类问题(将数据集划分为具有共同或相似特征的聚类)已得到广泛研究,但尚无用于解决从有噪声背景中识别数据聚类问题的通用算法。在本文中,我们提出了一种解决此类问题的新算法。我们已证明,在我们的定义下,一个聚类识别问题可以通过在一个线性序列中搜索具有特殊性质的子串来严格且高效地解决。我们还开发了一种评估每个识别出的聚类的统计显著性的方法,该方法可用于排除偶然的数据聚类。我们已将聚类识别算法和统计显著性分析方法实现为计算机软件CUBIC。已对CUBIC进行了广泛测试。我们在此展示CUBIC在具有挑战性的结合位点识别案例中的一些应用。