通过人鼠序列比较来破坏DNA中的调控位点
SMASHing regulatory sites in DNA by human-mouse sequence comparisons.
作者信息
Zavolan Mihaela, Socci Nicholas D, Rajewsky Nikolaus, Gaasterlamd Terry
机构信息
Laboratory for Computational Genomics, The Rockefeller University, New York, NY 10021, USA.
出版信息
Proc IEEE Comput Soc Bioinform Conf. 2003;2:277-86.
Regulatory sequence elements provide important clues to understanding and predicting gene expression. Although the binding sites for hundreds of transcription factors are known, there has been no systematic attempt to incorporate this information in the annotation of the human genome. Cross species sequence comparisons are critical to a meaningful annotation of regulatory elements since they generally reside in conserved non-coding regions. To take advantage of the recently completed drafts of the mouse and human genomes for annotating transcription factor binding sites, we developed SMASH, a computational pipeline that identifies thousands of orthologous human/ mouse proteins, maps them to genomic sequences, extracts and compares upstream regions and annotates putative regulatory elements in conserved, non-coding, upstream regions. Our current dataset consists of approximately 2,500 human/mouse gene pairs. Transcription start sites were estimated by mapping quasi-full length cDNA sequences. SMASH uses a novel probabilistic method to identify putative conserved binding sites that takes into account the competition between transcription factors for binding DNA. SMASH presents the results via a genome browser web interface which displays the predicted regulatory information together with the current annotations for the human genome. Our results are validated by comparison to previously published experimental data. SMASH results compare favorably to other existing computational approaches.
调控序列元件为理解和预测基因表达提供了重要线索。尽管已知数百种转录因子的结合位点,但尚未有系统地尝试将这些信息纳入人类基因组注释中。跨物种序列比较对于有意义地注释调控元件至关重要,因为它们通常位于保守的非编码区域。为了利用最近完成的小鼠和人类基因组草图来注释转录因子结合位点,我们开发了SMASH,这是一种计算流程,可识别数千个直系同源的人类/小鼠蛋白质,将它们映射到基因组序列,提取并比较上游区域,并注释保守非编码上游区域中的假定调控元件。我们目前的数据集包含约2500个人类/小鼠基因对。通过对准全长cDNA序列进行映射来估计转录起始位点。SMASH使用一种新颖的概率方法来识别假定的保守结合位点,该方法考虑了转录因子之间对结合DNA的竞争。SMASH通过基因组浏览器网页界面展示结果,该界面将预测的调控信息与人类基因组的当前注释一起显示。我们的结果通过与先前发表的实验数据进行比较而得到验证。SMASH的结果与其他现有计算方法相比更具优势。