Suryamohan Kushal, Halfon Marc S
Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY, USA; NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY, USA.
Wiley Interdiscip Rev Dev Biol. 2015 Mar-Apr;4(2):59-84. doi: 10.1002/wdev.168. Epub 2014 Dec 29.
Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website.
The authors have declared no conflicts of interest for this article.
基因表达是通过转录因子(TFs)和作用于特定DNA序列(称为顺式调控元件)的染色质修饰蛋白的活性来调控的。这些元件包括位于基因转录起始位点的启动子,以及各种远端顺式调控模块(CRMs),其中最常见的是转录增强子。由于基因表达的调控对于细胞分化和新细胞命运的获得至关重要,因此识别、表征和理解CRMs的作用机制对于理解发育过程至关重要。历史上,CRM的发现一直具有挑战性,因为CRMs可能位于远离它们所调控基因的位置,几乎没有易于识别的序列特征,而且多年来一直不适合高通量发现方法。然而,最近完整基因组序列的可得性以及下一代测序方法的发展,导致了在模式生物和非模式生物中用于CRM发现的计算方法和经验方法都大量涌现。在实验上,可以通过针对TFs或组蛋白翻译后修饰的染色质免疫沉淀、识别核小体缺失的“开放”染色质区域或基于测序的高通量功能筛选来识别CRMs。计算方法包括比较基因组学、已知或预测的TF结合位点的聚类,以及在已知CRMs上训练的监督机器学习方法。所有这些方法都已被证明对CRM发现有效,但每种方法都有其自身的考虑因素和局限性,并且都或多或少存在假阳性识别的情况。尽管当前方法存在不足表明需要开发额外的验证手段,但对预测结果进行实验确认至关重要。有关本文的更多资源,请访问WIREs网站。
作者声明本文不存在利益冲突。