Grad Yonatan H, Roth Frederick P, Halfon Marc S, Church George M
The Lipper Center for Computational Genetics, Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, Massachusetts, 02115, USA.
Bioinformatics. 2004 Nov 1;20(16):2738-50. doi: 10.1093/bioinformatics/bth320. Epub 2004 May 14.
To date, computational searches for cis-regulatory modules (CRMs) have relied on two methods. The first, phylogenetic footprinting, has been used to find CRMs in non-coding sequence, but does not directly link DNA sequence with spatio-temporal patterns of expression. The second, based on searches for combinations of transcription factor (TF) binding motifs, has been employed in genome-wide discovery of similarly acting enhancers, but requires prior knowledge of the set of TFs acting at the CRM and the TFs' binding motifs.
We propose a method for CRM discovery that combines aspects of both approaches in an effort to overcome their individual limitations. By treating phylogenetically footprinted non-coding regions (PFRs) as proxies for CRMs, we endeavor to find PFRs near co-regulated genes that are comprised of similar short, conserved sequences. Using Markov chains as a convenient formulation to assess similarity, we develop a sampling algorithm to search a large group of PFRs for the most similar subset. When starting with a set of genes involved in Drosophila early blastoderm development and using phylogenetic comparisons of Drosophila melanogaster and D.pseudoobscura genomes, we show here that our algorithm successfully detects known CRMs. Further, we use our similarity metric, based on Markov chain discrimination, in a genome-wide search, and uncover additional known and many candidate early blastoderm CRMs.
Software is available via http://arep.med.harvard.edu/enhancer
迄今为止,对顺式调控模块(CRM)的计算搜索依赖于两种方法。第一种是系统发育足迹法,已被用于在非编码序列中寻找CRM,但并未将DNA序列与时空表达模式直接联系起来。第二种方法基于对转录因子(TF)结合基序组合的搜索,已用于全基因组范围发现具有相似作用的增强子,但需要事先了解在CRM起作用的TF集合及其结合基序。
我们提出了一种用于发现CRM的方法,该方法结合了两种方法的各方面内容,以努力克服它们各自的局限性。通过将系统发育足迹化的非编码区域(PFR)视为CRM的代理,我们试图在由相似的短保守序列组成的共调控基因附近找到PFR。使用马尔可夫链作为评估相似性的便捷形式,我们开发了一种采样算法,在一大组PFR中搜索最相似的子集。当从一组参与果蝇早期胚盘发育的基因开始,并使用黑腹果蝇和拟暗果蝇基因组的系统发育比较时,我们在此表明我们的算法成功检测到了已知的CRM。此外,我们在全基因组搜索中使用基于马尔可夫链判别法的相似性度量,并发现了其他已知的以及许多候选的早期胚盘CRM。