Agarwal Vikram, Inoue Fumitaka, Schubach Max, Penzar Dmitry, Martin Beth K, Dash Pyaree Mohan, Keukeleire Pia, Zhang Zicong, Sohota Ajuni, Zhao Jingjing, Georgakopoulos-Soares Ilias, Noble William S, Yardımcı Galip Gürkan, Kulakovskiy Ivan V, Kircher Martin, Shendure Jay, Ahituv Nadav
Department of Genome Sciences, University of Washington, Seattle, WA, USA.
mRNA Center of Excellence, Sanofi, Waltham, MA, USA.
Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.
The human genome contains millions of candidate cis-regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these cCREs. Here we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of more than 680,000 sequences, representing an extensive set of annotated cCREs among three cell types (HepG2, K562 and WTC11), and found that 41.7% of these sequences were active. By testing sequences in both orientations, we find promoters to have strand-orientation biases and their 200-nucleotide cores to function as non-cell-type-specific 'on switches' that provide similar expression levels to their associated gene. By contrast, enhancers have weaker orientation biases, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict cCRE function and variant effects with high accuracy, delineate regulatory motifs and model their combinatorial effects. Testing a lentiMPRA library encompassing 60,000 cCREs in all three cell types further identified factors that determine cell-type specificity. Collectively, our work provides an extensive catalogue of functional CREs in three widely used cell lines and showcases how large-scale functional measurements can be used to dissect regulatory grammar.
人类基因组包含数百万个具有细胞类型特异性活性的候选顺式调控元件(cCRE),这些元件塑造了健康和许多疾病状态。然而,我们对控制这些cCRE活性和细胞类型特异性特征的序列特征缺乏功能上的理解。在这里,我们使用基于慢病毒的大规模平行报告基因检测(lentiMPRA)来测试超过680,000个序列的调控活性,这些序列代表了三种细胞类型(HepG2、K562和WTC11)中一组广泛注释的cCRE,并且发现其中41.7%的序列具有活性。通过在两个方向上测试序列,我们发现启动子具有链方向偏向性,其200个核苷酸的核心起到非细胞类型特异性“开启开关”的作用,为其相关基因提供相似的表达水平。相比之下,增强子的方向偏向性较弱,但组织特异性特征增强。利用我们的lentiMPRA数据,我们开发了基于序列的模型来高精度预测cCRE功能和变异效应,描绘调控基序并模拟它们的组合效应。在所有三种细胞类型中测试一个包含60,000个cCRE的lentiMPRA文库,进一步确定了决定细胞类型特异性的因素。总体而言,我们的工作提供了三种广泛使用的细胞系中功能性CRE的广泛目录,并展示了如何使用大规模功能测量来剖析调控语法。