从定位数据中鉴别性发现转录因子结合位点

Discriminative discovery of transcription factor binding sites from location data.

作者信息

Kawada Yuji, Sakakibara Yasubumi

机构信息

Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223-8522, Japan.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2005:86-9. doi: 10.1109/csb.2005.30.

DOI:10.1109/csb.2005.30

PMID:16447966

Abstract

MOTIVATION

The availability of genome-wide location analyses based on chromatin immunoprecipitation (ChIP) data gives a new insight for in silico analysis of transcriptional regulations.

RESULTS

We propose a novel discriminative discovery framework for precisely identifying transcriptional regulatory motifs from both positive and negative samples (sets of upstream sequences of both bound and unbound genes by a transcription factor (TF)) based on the genome-wide location data. In this framework, our goal is to find such discriminative motifs that best explain the location data in the sense that the motifs precisely discriminate the positive samples from the negative ones. First, in order to discover an initial set of discriminative substrings between positive and negative samples, we apply a decision tree learning method which produces a text-classification tree. We extract several clusters consisting of similar substrings from the internal nodes of the learned tree. Second, we start with initial profile-HMMs constructed from each cluster for representing putative motifs and iteratively refine the profile-HMMs to improve the discrimination accuracies. Our genome-wide experimental results on yeast show that our method successfully identifies the consensus sequences for known TFs in the literature and further presents significant performances for discriminating between positive and negative samples in all the TFs, while most other motif detecting methods show very poor performances on the problem of discriminations. Our learned profile-HMMs also improve false negative predictions of ChIP data.

摘要

动机

基于染色质免疫沉淀（ChIP）数据的全基因组定位分析的可用性为转录调控的计算机分析提供了新的见解。

结果

我们提出了一种新颖的判别式发现框架，用于基于全基因组定位数据从正样本和负样本（转录因子（TF）结合和未结合基因的上游序列集）中精确识别转录调控基序。在这个框架中，我们的目标是找到这样的判别基序，即从基序能精确区分正样本和负样本的意义上来说，能最好地解释定位数据。首先，为了发现正样本和负样本之间的一组初始判别子串，我们应用一种决策树学习方法，该方法生成一个文本分类树。我们从学习到的树的内部节点提取由相似子串组成的几个簇。其次，我们从由每个簇构建的初始轮廓隐马尔可夫模型（profile-HMM）开始，用于表示假定的基序，并迭代地优化轮廓隐马尔可夫模型以提高判别准确率。我们在酵母上的全基因组实验结果表明，我们的方法成功地识别了文献中已知TF的共有序列，并在所有TF中区分正样本和负样本方面进一步表现出显著性能，而大多数其他基序检测方法在判别问题上表现非常差。我们学习到的轮廓隐马尔可夫模型也改善了ChIP数据的假阴性预测。