Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany.
PLoS One. 2012;7(11):e49086. doi: 10.1371/journal.pone.0049086. Epub 2012 Nov 28.
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
要了解转录因子(TF)与其结合位点之间的关系对于反推调控过程至关重要。目前已报道的转录因子结合基序(TFBM)仅有约 10%-20%。这种数据的缺乏阻碍了对基因调控的理解。为了解决这一缺陷,我们提出了一种计算方法,该方法利用从未使用过的 TF 特性来发现所有人类基因启动子中缺失的 TFBM 及其位点。该方法首先预测一个调控“DNA 单词”的字典。从这个字典中,它提炼出 4098 个新的预测。为了揭示基序之间的串扰,另一个算法提取了 TF 组合结合模式,创建了一个 TF 调控语法规则的集合。使用这些规则,我们将经常出现在语法模式中的 504 个新基序缩小到一个列表中。我们将预测结果与 509 个已知基序进行了测试,证实我们的系统可以可靠地预测从头开始的基序,准确率为 81%-远远高于以前的方法。我们发现,平均而言,发现的组合结合模式中有 90%的模式至少靶向 10 个基因,这表明为了以独立的方式控制更小的基因集,需要额外的调控机制。此外,我们发现新的 TFBM 及其组合模式具有生物学意义,靶向与发育功能相关的 TF 和基因。因此,在基因组中所有可能的靶标中,TF 倾向于调节其他参与发育功能的 TF 和基因。我们提供了一个全面的调控分析资源,包括一个“DNA 单词”字典、新预测的基序及其对应的组合模式。组合模式是发现在协调其他因子方面发挥主要作用的 TFBM 的有用筛选器,因此,很可能锁定/解锁细胞功能簇。