Suppr超能文献

通过大语言模型预测TET和DNMT3基因敲除突变体中差异甲基化的胞嘧啶

Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model.

作者信息

Sereshki Saleh, Lonardi Stefano

机构信息

Department of Computer Science and Engineering, University of California, Riverside, 900 University Ave, Riverside, CA 92521, United States.

出版信息

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf092.

Abstract

DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.

摘要

DNA甲基化是一种表观遗传标记,可直接或间接调节多个关键细胞过程。虽然哺乳动物基因组中的胞嘧啶通常随时间保持稳定的甲基化模式,但属于特定调控区域(如启动子和增强子)的其他胞嘧啶可表现出动态变化。这些甲基化变化由复杂的细胞机制驱动,其中DNMT3和TET酶发挥关键作用。本研究的目的是设计一种机器学习模型,能够根据周围的DNA序列准确预测哪些胞嘧啶具有波动的甲基化水平(以下称为差异甲基化胞嘧啶,DMCs)。在此,我们介绍L-MAP,这是一种基于Transformer的大语言模型,它在人类和小鼠胚胎干细胞的DNMT3基因敲除和TET基因敲除数据上进行训练。我们广泛的实验结果证明了L-MAP在预测DMCs方面的高准确性。我们的实验还探讨了在人类基因敲除数据上训练的分类器是否可以预测小鼠基因组中的DMCs(反之亦然),以及在DNMT3基因敲除数据上训练的分类器是否可以预测TET基因敲除中的DMCs(反之亦然)。L-MAP能够识别与DNMT3和TET酶活性相关的序列基序,其中包括已知基序以及可能为干细胞中的DNA甲基化提供新见解的新结合位点。可通过https://github.com/ucrbioinfo/dmc_prediction获取L-MAP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a73/11904404/cf8114339747/bbaf092f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验