Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health (NIH), Bethesda, Maryland 20894, USA.
Genome Res. 2010 Mar;20(3):381-92. doi: 10.1101/gr.098657.109. Epub 2010 Jan 14.
The various organogenic programs deployed during embryonic development rely on the precise expression of a multitude of genes in time and space. Identifying the cis-regulatory elements responsible for this tightly orchestrated regulation of gene expression is an essential step in understanding the genetic pathways involved in development. We describe a strategy to systematically identify tissue-specific cis-regulatory elements that share combinations of sequence motifs. Using heart development as an experimental framework, we employed a combination of Gibbs sampling and linear regression to build a classifier that identifies heart enhancers based on the presence and/or absence of various sequence features, including known and putative transcription factor (TF) binding specificities. In distinguishing heart enhancers from a large pool of random noncoding sequences, the performance of our classifier is vastly superior to four commonly used methods, with an accuracy reaching 92% in cross-validation. Furthermore, most of the binding specificities learned by our method resemble the specificities of TFs widely recognized as key players in heart development and differentiation, such as SRF, MEF2, ETS1, SMAD, and GATA. Using our classifier as a predictor, a genome-wide scan identified over 40,000 novel human heart enhancers. Although the classifier used no gene expression information, these novel enhancers are strongly associated with genes expressed in the heart. Finally, in vivo tests of our predictions in mouse and zebrafish achieved a validation rate of 62%, significantly higher than what is expected by chance. These results support the existence of underlying cis-regulatory codes dictating tissue-specific transcription in mammalian genomes and validate our enhancer classifier strategy as a method to uncover these regulatory codes.
胚胎发育过程中各种器官发生程序依赖于众多基因在时间和空间上的精确表达。鉴定负责这种基因表达精确调控的顺式调控元件是理解参与发育的遗传途径的关键步骤。我们描述了一种系统识别具有组合序列基序的组织特异性顺式调控元件的策略。我们以心脏发育为实验框架,结合 Gibbs 抽样和线性回归来构建一个分类器,该分类器基于各种序列特征(包括已知和假定的转录因子 [TF] 结合特异性)的存在与否来识别心脏增强子。在将心脏增强子与大量随机非编码序列区分开来时,我们的分类器的性能远远优于四种常用方法,交叉验证的准确率达到 92%。此外,我们方法中学习到的大多数结合特异性与广泛认为是心脏发育和分化关键参与者的 TF 特异性相似,例如 SRF、MEF2、ETS1、SMAD 和 GATA。使用我们的分类器作为预测器,对人类基因组进行了全基因组扫描,鉴定出超过 40,000 个新的人类心脏增强子。尽管分类器没有使用基因表达信息,但这些新的增强子与在心脏中表达的基因强烈相关。最后,在小鼠和斑马鱼中的体内预测测试达到了 62%的验证率,明显高于随机预期。这些结果支持在哺乳动物基因组中存在决定组织特异性转录的潜在顺式调控代码,并验证了我们的增强子分类器策略作为揭示这些调控代码的方法的有效性。