Division of Bioinformatics, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90089, USA.
Department of Surgery, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90089, USA.
NPJ Syst Biol Appl. 2023 Apr 3;9(1):9. doi: 10.1038/s41540-023-00270-z.
The vast majority of disease-associated variants identified in genome-wide association studies map to enhancers, powerful regulatory elements which orchestrate the recruitment of transcriptional complexes to their target genes' promoters to upregulate transcription in a cell type- and timing-dependent manner. These variants have implicated thousands of enhancers in many common genetic diseases, including nearly all cancers. However, the etiology of most of these diseases remains unknown because the regulatory target genes of the vast majority of enhancers are unknown. Thus, identifying the target genes of as many enhancers as possible is crucial for learning how enhancer regulatory activities function and contribute to disease. Based on experimental results curated from scientific publications coupled with machine learning methods, we developed a cell type-specific score predictive of an enhancer targeting a gene. We computed the score genome-wide for every possible cis enhancer-gene pair and validated its predictive ability in four widely used cell lines. Using a pooled final model trained across multiple cell types, all possible gene-enhancer regulatory links in cis (~17 M) were scored and added to the publicly available PEREGRINE database ( www.peregrineproj.org ). These scores provide a quantitative framework for the enhancer-gene regulatory prediction that can be incorporated into downstream statistical analyses.
在全基因组关联研究中发现的绝大多数与疾病相关的变异都映射到增强子上,增强子是一种强大的调控元件,能够协调转录复合物招募到其靶基因的启动子,以细胞类型和时间依赖的方式上调转录。这些变体已经涉及到许多常见遗传疾病中的数千个增强子,包括几乎所有的癌症。然而,由于绝大多数增强子的调控靶基因尚不清楚,这些疾病的大部分病因仍然未知。因此,尽可能多地识别增强子的靶基因对于了解增强子调控活动的功能以及它们如何导致疾病至关重要。基于从科学出版物中整理的实验结果以及机器学习方法,我们开发了一种细胞特异性评分方法,可预测增强子靶向基因。我们在全基因组范围内为每一个可能的顺式增强子-基因对计算了得分,并在四个广泛使用的细胞系中验证了其预测能力。使用跨多个细胞类型训练的 pooled 最终模型,对顺式(~17M)中所有可能的基因-增强子调控联系进行了评分,并将其添加到可公开获取的 PEREGRINE 数据库(www.peregrineproj.org)中。这些分数为增强子-基因调控预测提供了一个定量框架,可以整合到下游的统计分析中。