EpiGePT：一种用于特定背景人类表观基因组学的基于预训练Transformer的语言模型。

EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics.

作者信息

Gao Zijing, Liu Qiao, Zeng Wanwen, Jiang Rui, Wong Wing Hung

机构信息

Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China.

Department of Statistics, Stanford University, CA, Stanford, 94305, USA.

出版信息

Genome Biol. 2024 Dec 18;25(1):310. doi: 10.1186/s13059-024-03449-7.

DOI:10.1186/s13059-024-03449-7

PMID:39696471

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11657395/

Abstract

The inherent similarities between natural language and biological sequences have inspired the use of large language models in genomics, but current models struggle to incorporate chromatin interactions or predict in unseen cellular contexts. To address this, we propose EpiGePT, a transformer-based model designed for predicting context-specific human epigenomic signals. By incorporating transcription factor activities and 3D genome interactions, EpiGePT outperforms existing methods in epigenomic signal prediction tasks, especially in cell-type-specific long-range interaction predictions and genetic variant impacts, advancing our understanding of gene regulation. A free online prediction service is available at http://health.tsinghua.edu.cn/epigept .

摘要

自然语言与生物序列之间的内在相似性激发了人们在基因组学中使用大语言模型的想法，但目前的模型在整合染色质相互作用或预测未知细胞环境方面存在困难。为了解决这一问题，我们提出了EpiGePT，这是一种基于Transformer的模型，旨在预测特定背景下的人类表观基因组信号。通过整合转录因子活性和三维基因组相互作用，EpiGePT在表观基因组信号预测任务中优于现有方法，尤其是在细胞类型特异性的长程相互作用预测和基因变异影响方面，加深了我们对基因调控的理解。可通过http://health.tsinghua.edu.cn/epigept获得免费的在线预测服务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82db/11657395/441076a392ef/13059_2024_3449_Fig1_HTML.jpg

相似文献

EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics.EpiGePT：一种用于特定背景人类表观基因组学的基于预训练Transformer的语言模型。

Genome Biol. 2024 Dec 18;25(1):310. doi: 10.1186/s13059-024-03449-7.

EpiGePT: a Pretrained Transformer model for epigenomics.EpiGePT：一种用于表观基因组学的预训练Transformer模型。

bioRxiv. 2024 Feb 3:2023.07.15.549134. doi: 10.1101/2023.07.15.549134.

Epigenomics in 3D: importance of long-range spreading and specific interactions in epigenomic maintenance.三维组学：长程扩散和特定相互作用在表观基因组维持中的重要性。

Nucleic Acids Res. 2018 Mar 16;46(5):2252-2264. doi: 10.1093/nar/gky009.

Epitome: predicting epigenetic events in novel cell types with multi-cell deep ensemble learning.要点：使用多细胞深度学习集成预测新型细胞类型中的表观遗传事件。

Nucleic Acids Res. 2021 Nov 8;49(19):e110. doi: 10.1093/nar/gkab676.

Predicting cell type-specific epigenomic profiles accounting for distal genetic effects.预测细胞类型特异性表观基因组图谱，同时考虑远端遗传效应。

Nat Commun. 2024 Nov 16;15(1):9951. doi: 10.1038/s41467-024-54441-5.

A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles.用于预测染色质与DNA相互作用及表观基因组图谱的深度学习模型综述。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae651.

UNADON: transformer-based model to predict genome-wide chromosome spatial position.UNADON：基于转换器的模型，用于预测全基因组染色体空间位置。

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i553-i562. doi: 10.1093/bioinformatics/btad246.

Understanding epigenomics based on the rice model.基于水稻模型理解表观基因组学。

Theor Appl Genet. 2020 May;133(5):1345-1363. doi: 10.1007/s00122-019-03518-7. Epub 2020 Jan 2.

EXPRESSO: a multi-omics database to explore multi-layered 3D genomic organization.EXPRESSO：一个用于探索多层三维基因组组织的多组学数据库。

Nucleic Acids Res. 2025 Jan 6;53(D1):D79-D90. doi: 10.1093/nar/gkae999.

COCOA: A Framework for Fine-scale Mapping of Cell-type-specific Chromatin Compartments Using Epigenomic Information.COCOA：一个利用表观基因组信息进行细胞类型特异性染色质区室精细定位的框架。

Genomics Proteomics Bioinformatics. 2025 Jan 15;22(6). doi: 10.1093/gpbjnl/qzae091.

引用本文的文献

Gaining insights into epigenetic memories through artificial intelligence and omics science in plants.通过人工智能和植物组学科学深入了解表观遗传记忆。

J Integr Plant Biol. 2025 Sep;67(9):2320-2349. doi: 10.1111/jipb.13953. Epub 2025 Jun 24.

Improving polygenic prediction from whole-genome sequencing data by leveraging predicted epigenomic features.通过利用预测的表观基因组特征提高全基因组测序数据的多基因预测能力。

Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2419202122. doi: 10.1073/pnas.2419202122. Epub 2025 Jun 12.

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景：对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。

Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.

Methyl-GP: accurate generic DNA methylation prediction based on a language model and representation learning.甲基化基因组图谱（Methyl-GP）：基于语言模型和表征学习的准确通用DNA甲基化预测

Nucleic Acids Res. 2025 Mar 20;53(6). doi: 10.1093/nar/gkaf223.

GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies.基因组海洋：一种基于大规模宏基因组组装训练的高效基因组基础模型。

bioRxiv. 2025 Feb 5:2025.01.30.635558. doi: 10.1101/2025.01.30.635558.

EPInformer: a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with multimodal epigenomic data.EPInformer：一种通过整合启动子-增强子序列与多组学表观基因组数据进行基因表达预测的可扩展深度学习框架。

bioRxiv. 2024 Aug 1:2024.08.01.606099. doi: 10.1101/2024.08.01.606099.

本文引用的文献

CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.CZ CELLxGENE发现平台：一个用于对聚合数据进行可扩展探索、分析和建模的单细胞数据平台。

Nucleic Acids Res. 2025 Jan 6;53(D1):D886-D900. doi: 10.1093/nar/gkae1142.

Bioinformatics and biomedical informatics with ChatGPT: Year one review.ChatGPT在生物信息学和生物医学信息学中的应用：年度综述

Quant Biol. 2024 Dec;12(4):345-359. doi: 10.1002/qub2.67. Epub 2024 Jun 27.

OpenAnnotateApi: Python and R packages to efficiently annotate and analyze chromatin accessibility of genomic regions.OpenAnnotateApi：用于高效注释和分析基因组区域染色质可及性的Python和R包。

Bioinform Adv. 2024 Apr 10;4(1):vbae055. doi: 10.1093/bioadv/vbae055. eCollection 2024.

VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center in 2023.VEuPathDB：2023 年的真核病原体、载体和宿主生物信息学资源中心。

Nucleic Acids Res. 2024 Jan 5;52(D1):D808-D816. doi: 10.1093/nar/gkad1003.

scEpiTools: a database to comprehensively interrogate analytic tools for single-cell epigenomic data.scEpiTools：一个全面查询单细胞表观基因组数据分析工具的数据库。

J Genet Genomics. 2024 Apr;51(4):462-465. doi: 10.1016/j.jgg.2023.09.011. Epub 2023 Sep 27.

The human cell count and size distribution.人体细胞计数和大小分布。

Proc Natl Acad Sci U S A. 2023 Sep 26;120(39):e2303077120. doi: 10.1073/pnas.2303077120. Epub 2023 Sep 18.

Comprehensive tissue deconvolution of cell-free DNA by deep learning for disease diagnosis and monitoring.通过深度学习对游离 DNA 进行全面的组织去卷积，用于疾病诊断和监测。

Proc Natl Acad Sci U S A. 2023 Jul 11;120(28):e2305236120. doi: 10.1073/pnas.2305236120. Epub 2023 Jul 3.

Methods and applications for single-cell and spatial multi-omics.单细胞和空间多组学的方法和应用。

Nat Rev Genet. 2023 Aug;24(8):494-515. doi: 10.1038/s41576-023-00580-2. Epub 2023 Mar 2.

Applications of transformer-based language models in bioinformatics: a survey.基于Transformer的语言模型在生物信息学中的应用：一项综述。

Bioinform Adv. 2023 Jan 11;3(1):vbad001. doi: 10.1093/bioadv/vbad001. eCollection 2023.

A DNA methylation atlas of normal human cell types.正常人类细胞类型的 DNA 甲基化图谱。

Nature. 2023 Jan;613(7943):355-364. doi: 10.1038/s41586-022-05580-6. Epub 2023 Jan 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

EpiGePT：一种用于特定背景人类表观基因组学的基于预训练Transformer的语言模型。

EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献