Pavesi Giulio, Zambelli Federico, Pesole Graziano
Dipartimento di Scienze Biomolecolari e Biotecnologie, University of Milan, Milan, Italy.
BMC Bioinformatics. 2007 Feb 7;8:46. doi: 10.1186/1471-2105-8-46.
This work addresses the problem of detecting conserved transcription factor binding sites and in general regulatory regions through the analysis of sequences from homologous genes, an approach that is becoming more and more widely used given the ever increasing amount of genomic data available.
We present an algorithm that identifies conserved transcription factor binding sites in a given sequence by comparing it to one or more homologs, adapting a framework we previously introduced for the discovery of sites in sequences from co-regulated genes. Differently from the most commonly used methods, the approach we present does not need or compute an alignment of the sequences investigated, nor resorts to descriptors of the binding specificity of known transcription factors. The main novel idea we introduce is a relative measure of conservation, assuming that true functional elements should present a higher level of conservation with respect to the rest of the sequence surrounding them. We present tests where we applied the algorithm to the identification of conserved annotated sites in homologous promoters, as well as in distal regions like enhancers.
Results of the tests show how the algorithm can provide fast and reliable predictions of conserved transcription factor binding sites regulating the transcription of a gene, with better performances than other available methods for the same task. We also show examples on how the algorithm can be successfully employed when promoter annotations of the genes investigated are missing, or when regulatory sites and regions are located far away from the genes.
这项工作通过分析同源基因的序列来解决检测保守转录因子结合位点以及一般调控区域的问题,鉴于可用的基因组数据量不断增加,这种方法正变得越来越广泛地被使用。
我们提出了一种算法,通过将给定序列与一个或多个同源序列进行比较来识别其中保守的转录因子结合位点,该算法采用了我们先前为发现共调控基因序列中的位点而引入的框架。与最常用的方法不同,我们提出的方法不需要也不计算所研究序列的比对,也不依赖已知转录因子结合特异性的描述符。我们引入的主要新思想是一种保守性的相对度量,假设真正的功能元件相对于其周围的其余序列应具有更高的保守水平。我们展示了将该算法应用于识别同源启动子以及增强子等远端区域中保守注释位点的测试。
测试结果表明该算法如何能够快速可靠地预测调控基因转录的保守转录因子结合位点,在相同任务中比其他现有方法具有更好的性能。我们还展示了一些示例,说明当所研究基因的启动子注释缺失,或者调控位点和区域距离基因很远时,该算法如何能够成功应用。