Olman Victor, Xu Dong, Xu Ying
Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480, USA.
Pac Symp Biocomput. 2003:327-38.
Recognition of protein-binding sites from the upstream regions of genes is a highly important and unsolved problem. In this paper, we present a new approach for studying this challenging issue. We formulate the binding-site recognition problem as a cluster identification problem, i.e., to identify clusters in a data set that exhibit significantly different features (e.g., density) than the overall background of the data set. We have developed a general framework for solving such a cluster identification problem. The foundation of the framework is a rigorous relationship between data clusters and subtrees of a minimum spanning tree (MST) representation of a data set. We have proposed a formal and general definition of clusters, and have demonstrated that a cluster is always represented as a connected component of the MST, and further it corresponds to a substring of a linear representation of the MST. Hence a cluster identification problem is reduced to a problem of finding substrings with certain features, for which algorithms have been developed. We have applied this MST-based cluster identification algorithm to a number of binding site identification problems. The results are highly encouraging.
从基因上游区域识别蛋白质结合位点是一个极其重要但尚未解决的问题。在本文中,我们提出了一种研究这一具有挑战性问题的新方法。我们将结合位点识别问题表述为一个聚类识别问题,即识别数据集中与数据集整体背景具有显著不同特征(如密度)的聚类。我们开发了一个用于解决此类聚类识别问题的通用框架。该框架的基础是数据聚类与数据集最小生成树(MST)表示的子树之间的严格关系。我们提出了聚类的形式化通用定义,并证明一个聚类总是表示为MST的一个连通分量,并且进一步它对应于MST线性表示的一个子串。因此,聚类识别问题简化为寻找具有某些特征子串的问题,针对此已开发出算法。我们已将这种基于MST的聚类识别算法应用于多个结合位点识别问题。结果非常令人鼓舞。