Hertzberg Libi, Izraeli Shai, Domany Eytan
Department of Pediatric Hemato-Oncology, The Sheba Cancer Research Center, Tel Hashomer, Israel.
Bioinformatics. 2007 Jul 15;23(14):1737-43. doi: 10.1093/bioinformatics/btm249. Epub 2007 May 8.
Existing computational methods that identify transcription factor (TF) binding sites on a gene's promoter are plagued by significant inaccuracies. Binding of a TF to a particular sequence is assessed by comparing its similarity score, obtained from the TF's known position weight matrix (PWM), to a threshold. If the similarity score is above the threshold, the sequence is considered a putative binding site. Determining this threshold is a central part of the problem, for which no satisfactory biologically based solution exists.
We present here a method that integrates gene expression data with sequence-based scoring of TF binding sites, for determining a global score threshold for each TF. We validate our method, STOP (Searching TFs Of Promoters), in several ways: (1) we calculate the average expression values of groups of human putative target genes of each TF, and compare them to similar averages derived for random gene groups. The groups of putative targets show significantly higher relative average expression. (2) We find high consistency between the induced lists of putative targets in human and in mouse. (3) The expression patterns associated with human and mouse genes (ordered by PWM scores for each TF) exhibit high similarity between human and mouse, indicating that our method has firm biological basis. (4) Comparison of results obtained by STOP and PRIMA (Elkon et al., 2003) suggests that determining the score threshold using gene expression, as is done in STOP, is more biologically tuned.
Software package will be available for academic users upon request.
Supplementary data are available on Bioinformatics online.
现有的用于识别基因启动子上转录因子(TF)结合位点的计算方法存在显著的不准确问题。通过将从TF已知的位置权重矩阵(PWM)获得的相似性得分与一个阈值进行比较,来评估TF与特定序列的结合情况。如果相似性得分高于阈值,该序列就被视为一个假定的结合位点。确定这个阈值是问题的核心部分,目前尚无基于生物学的令人满意的解决方案。
我们在此提出一种方法,该方法将基因表达数据与基于序列的TF结合位点评分相结合,以确定每个TF的全局得分阈值。我们通过多种方式验证了我们的方法STOP(启动子转录因子搜索):(1)我们计算每个TF的人类假定靶基因组的平均表达值,并将它们与从随机基因组得出的类似平均值进行比较。假定靶基因组显示出显著更高的相对平均表达。(2)我们发现在人类和小鼠中诱导的假定靶标列表之间具有高度一致性。(3)与人类和小鼠基因相关的表达模式(按每个TF的PWM得分排序)在人类和小鼠之间表现出高度相似性,这表明我们的方法有坚实的生物学基础。(4)对STOP和PRIMA(Elkon等人,2003年)所得结果的比较表明,如在STOP中那样使用基因表达来确定得分阈值在生物学上更具针对性调节作用。
软件包将应学术用户要求提供。
补充数据可在《生物信息学》在线获取。