顺式调控元件的预测：从高信息含量分析到基序识别

Li Guojun, Lu Jizhu, Olman Victor, Xu Ying

School of Mathematics and System Sciences, Shandong University, Jinan 250100, China.

J Bioinform Comput Biol. 2007 Aug;5(4):817-38. doi: 10.1142/s021972000700293x.

One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.

预测转录因子结合基序的一种流行方法是将该问题建模为搜索一组 l 聚体（基序），对于某个 l > 0，从一组共表达基因的每个提供的启动子区域中选取一个，这些 l 聚体在无间隙对齐时表现出高信息含量。在我们当前的工作中，我们假设这些期望的 l 聚体从一个共同祖先进化而来，其中每个 l 聚体与共同祖先相比最多有 k 个位置发生突变，这里 k 远小于 l。这意味着这些 l 聚体应该属于其共同祖先的 k 邻域，以汉明距离来衡量。如果祖先已知，那么寻找这些 l 聚体的问题就变得微不足道。不幸的是，识别未知祖先的问题可能与预测基序本身的问题一样困难。我们的目标是识别一组 l 聚体，它们稍微偏离一个假定祖先的 k 邻域，但捕获所有期望的基序，这将导致一种识别期望基序的有效方法。本文的主要贡献在四个方面：(a) 我们推导出了一组与未知祖先在不超过 k 个位置不同的 l 聚体的信息含量的非平凡下限和上限；(b) 我们基于新距离定义了两个序列之间的新距离和一个 k 伪邻域，该 k 伪邻域包含待定义祖先的由汉明距离定义的 k 邻域；(c) 我们开发了一种算法，使用新距离来最小化预测的祖先基序与来自提供的启动子区域的一组 l 聚体之间的所有距离之和；(d) 我们测试了 PROMOCO 并将其预测结果性能与其他两个预测程序进行了比较。该算法作为计算机软件程序 PROMOCO 实现，已用于在一组提供的启动子序列中找到所有保守基序。我们对 PROMOCO 的初步应用表明，与用于识别顺式调控结合基序的流行程序相比，它能取得更好或相当的预测结果。该算法的一个局限性是，当提供的启动子序列集的大小太小或期望基序仅出现在给定序列的一小部分中时，它的效果不佳。