Ying Kejun, Song Jinyeop, Cui Haotian, Zhang Yikun, Li Siyuan, Chen Xingyu, Liu Hanna, Eames Alec, McCartney Daniel L, Marioni Riccardo E, Poganik Jesse R, Moqri Mahdi, Wang Bo, Gladyshev Vadim N
Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
T. H. Chan School of Public Health, Harvard University, Boston, MA, USA.
bioRxiv. 2024 Nov 4:2024.10.30.621013. doi: 10.1101/2024.10.30.621013.
DNA methylation serves as a powerful biomarker for disease diagnosis and biological age assessment. However, current analytical approaches often rely on linear models that cannot capture the complex, context-dependent nature of methylation regulation. Here we present MethylGPT, a transformer-based foundation model trained on 226,555 (154,063 after QC and deduplication) human methylation profiles spanning diverse tissue types from 5,281 datasets, curated 49,156 CpG sites, and 7.6 billion training tokens. MethylGPT learns biologically meaningful representations of CpG sites, capturing both local genomic context and higher-order chromosomal features without external supervision. The model demonstrates robust methylation value prediction (Pearson R=0.929) and maintains stable performance in downstream tasks with up to 70% missing data. Applied to age prediction across multiple tissue types, MethylGPT achieves superior accuracy compared to existing methods. Analysis of the model's attention patterns reveals distinct methylation signatures between young and old samples, with differential enrichment of developmental and aging-associated pathways. When finetuned to mortality and disease prediction across 60 major conditions using 18,859 samples from Generation Scotland, MethylGPT achieves robust predictive performance and enables systematic evaluation of intervention effects on disease risks, demonstrating potential for clinical applications. Our results demonstrate that transformer architectures can effectively model DNA methylation patterns while preserving biological interpretability, suggesting broad utility for epigenetic analysis and clinical applications.
DNA甲基化是疾病诊断和生物学年龄评估的有力生物标志物。然而,目前的分析方法通常依赖于线性模型,无法捕捉甲基化调控复杂的、上下文相关的本质。在此,我们展示了MethylGPT,这是一种基于Transformer的基础模型,它在来自5281个数据集的226,555个(质量控制和去重后为154,063个)跨越不同组织类型的人类甲基化谱、精心挑选的49,156个CpG位点以及76亿个训练token上进行训练。MethylGPT学习CpG位点具有生物学意义的表征,在没有外部监督的情况下捕捉局部基因组上下文和高阶染色体特征。该模型展示了强大的甲基化值预测能力(Pearson相关系数R = 0.929),并且在缺失数据高达70%的下游任务中保持稳定性能。应用于跨多种组织类型的年龄预测时,MethylGPT与现有方法相比具有更高的准确性。对该模型注意力模式的分析揭示了年轻和年老样本之间不同的甲基化特征,以及发育和衰老相关通路的差异富集。当使用来自苏格兰世代研究的18,859个样本针对60种主要疾病进行死亡率和疾病预测的微调时,MethylGPT实现了强大的预测性能,并能够系统评估干预对疾病风险的影响,显示出临床应用的潜力。我们的结果表明,Transformer架构可以有效地对DNA甲基化模式进行建模,同时保持生物学可解释性,并为表观遗传学分析和临床应用提供广泛的用途。