Fu Xi, Mo Shentong, Buendia Alejandro, Laurent Anouchka P, Shao Anqi, Alvarez-Torres Maria Del Mar, Yu Tianji, Tan Jimin, Su Jiayu, Sagatelian Romella, Ferrando Adolfo A, Ciccia Alberto, Lan Yanyan, Owens David M, Palomero Teresa, Xing Eric P, Rabadan Raul
Program of Mathematical Genomics, Department of Systems Biology, Columbia University, New York, NY, USA.
Department of Biomedical Informatics, Columbia University, New York, NY, USA.
Nature. 2025 Jan;637(8047):965-973. doi: 10.1038/s41586-024-08391-z. Epub 2025 Jan 8.
Transcriptional regulation, which involves a complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate to unseen cell types and conditions. Here we introduce GET (general expression transformer), an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types. Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types. GET also shows remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovers universal and cell-type-specific transcription factor interaction networks. We evaluated its performance in prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors and found that it outperforms current models in predicting lentivirus-based massively parallel reporter assay readout. In fetal erythroblasts, we identified distal (greater than 1 Mbp) regulatory regions that were missed by previous models, and, in B cells, we identified a lymphocyte-specific transcription factor-transcription factor interaction that explains the functional significance of a leukaemia risk predisposing germline mutation. In sum, we provide a generalizable and accurate model for transcription together with catalogues of gene regulation and transcription factor interactions, all with cell type specificity.
转录调控涉及调控序列与蛋白质之间复杂的相互作用,指导着所有的生物过程。转录的计算模型缺乏可推广性,无法准确外推到未见过的细胞类型和条件。在此,我们引入了GET(通用表达变换器),这是一个可解释的基础模型,旨在揭示213种人类胎儿和成人细胞类型中的调控语法。GET仅依靠染色质可及性数据和序列信息,即使在以前未见过的细胞类型中,在预测基因表达方面也能达到实验水平的准确性。GET在新的测序平台和检测方法中也表现出显著的适应性,能够在广泛的细胞类型和条件下进行调控推断,并揭示通用的和细胞类型特异性的转录因子相互作用网络。我们评估了它在调控活性预测、调控元件和调控因子推断以及转录因子之间物理相互作用识别方面的性能,发现它在预测基于慢病毒的大规模平行报告基因检测读数方面优于当前模型。在胎儿成红细胞中,我们鉴定出了先前模型遗漏的远端(大于1兆碱基对)调控区域,并且在B细胞中,我们鉴定出了一种淋巴细胞特异性的转录因子-转录因子相互作用,它解释了一种白血病风险易感种系突变的功能意义。总之,我们提供了一个可推广且准确的转录模型以及基因调控和转录因子相互作用的目录,所有这些都具有细胞类型特异性。