Chen Chien-Yu, Tsai Huai-Kuang, Hsu Chen-Ming, May Chen Mei-Ju, Hung Hao-Geng, Huang Grace Tzu-Wei, Li Wen-Hsiung
Department of Bio-Industrial Mechatronics Engineering, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 106, Taiwan.
Proc Natl Acad Sci U S A. 2008 Feb 19;105(7):2527-32. doi: 10.1073/pnas.0712188105. Epub 2008 Feb 13.
A gapped transcription factor-binding site (TFBS) contains one or more highly degenerate positions. Discovering gapped motifs is difficult, because allowing highly degenerate positions in a motif greatly enlarges the search space and complicates the discovery process. Here, we propose a method for discovering TFBSs, especially gapped motifs. We use ChIP-chip data to judge the binding strength of a TF to a putative target promoter and use orthologous sequences from related species to judge the degree of evolutionary conservation of a predicted TFBS. Candidate motifs are constructed by growing compact motif blocks and by concatenating two candidate blocks, allowing 0-15 degenerate positions in between. The resultant patterns are statistically evaluated for their ability to distinguish between target and nontarget genes. Then, a position-based ranking procedure is proposed to enhance the signals of true motifs by collecting position concurrences. Empirical tests on 32 known yeast TFBSs show that the method is highly accurate in identifying gapped motifs, outperforming current methods, and it also works well on ungapped motifs. Predictions on additional 54 TFs successfully discover 11 gapped and 38 ungapped motifs supported by literature. Our method achieves high sensitivity and specificity for predicting experimentally verified TFBSs.
一个有间隔的转录因子结合位点(TFBS)包含一个或多个高度简并的位置。发现有间隔的基序很困难,因为在基序中允许高度简并的位置会极大地扩大搜索空间并使发现过程复杂化。在此,我们提出一种发现TFBS的方法,特别是有间隔的基序。我们使用芯片免疫沉淀(ChIP-chip)数据来判断转录因子与假定靶启动子的结合强度,并使用相关物种的直系同源序列来判断预测的TFBS的进化保守程度。候选基序通过生长紧凑的基序块以及连接两个候选块来构建,在它们之间允许0至15个简并位置。对所得模式区分靶基因和非靶基因的能力进行统计评估。然后,提出一种基于位置的排序程序,通过收集位置一致性来增强真实基序的信号。对32个已知酵母TFBS的实证测试表明,该方法在识别有间隔的基序方面高度准确,优于当前方法,并且在无间隔的基序上也表现良好。对另外54个转录因子的预测成功发现了11个有间隔和38个无间隔的基序,这些基序得到了文献的支持。我们的方法在预测经实验验证的TFBS方面实现了高灵敏度和特异性。