Gladstone Institute of Cardiovascular Disease, San Francisco, California, United States of America; Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America.
Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America; Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America.
PLoS Comput Biol. 2014 Jun 26;10(6):e1003677. doi: 10.1371/journal.pcbi.1003677. eCollection 2014 Jun.
Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology.
基因调控增强子已经通过多种方法被鉴定,包括进化保守性、调控蛋白结合、染色质修饰和 DNA 序列基序。为了整合这些不同的方法,我们开发了 EnhancerFinder,这是一种两步法,用于区分发育增强子和基因组背景,然后预测其组织特异性。EnhancerFinder 使用多内核学习方法来整合 DNA 序列基序、进化模式以及来自多种细胞类型的各种功能基因组数据集。与基于单个细胞系的组蛋白标记或 p300 位点定义增强子的预测方法不同,我们在来自 VISTA 增强子浏览器的数百个经过实验验证的人类发育增强子上训练了 EnhancerFinder。我们使用交叉验证全面评估了 EnhancerFinder,发现我们的综合方法比仅考虑一种类型数据(如序列基序、进化保守性或增强子相关蛋白的结合)的方法更能提高增强子的识别能力。我们发现,在胚胎心脏中活跃的 VISTA 增强子比在其他几种胚胎组织中活跃的增强子更容易识别,这可能是由于它们独特的高 GC 含量。我们将 EnhancerFinder 应用于整个人类基因组,预测了 84,301 个发育增强子及其组织特异性。这些预测为大量人类非编码 DNA 提供了特定的功能注释,并且在其预测组织中具有注释作用的基因附近以及全基因组关联研究中的 lead SNPs 显著富集。我们通过对来自三个发育转录因子基因座的新型胚胎基因调控增强子的体内验证,证明了 EnhancerFinder 预测的有效性。我们的全基因组发育增强子预测作为 UCSC 基因组浏览器轨迹免费提供,我们希望这将使研究人员能够进一步研究发育生物学中的问题。