木兰-甲基-多变压器语言模型，用于准确预测 DNA 甲基化。

MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction.

机构信息

Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.

International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.

出版信息

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad054. Epub 2023 Jul 25.

DOI:10.1093/gigascience/giad054

PMID:37489753

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10367125/

Abstract

Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.

摘要

基于转换器的语言模型成功地用于解决大量与文本相关的任务。DNA 甲基化是一种重要的表观遗传机制，其分析为基因调控和生物标志物识别提供了有价值的见解。已经提出了几种基于深度学习的方法来识别 DNA 甲基化，每种方法都在计算工作量和准确性之间寻求平衡。在这里，我们介绍了 MuLan-Methyl，这是一种用于预测 DNA 甲基化位点的深度学习框架，它基于 5 种流行的基于转换器的语言模型。该框架识别了 3 种不同类型的 DNA 甲基化的甲基化位点：N6-腺嘌呤、N4-胞嘧啶和 5-羟甲基胞嘧啶。所使用的每个语言模型都使用“预训练和微调”范式适应任务。预训练是在使用自监督学习的自定义 DNA 片段和分类群语料库上进行的。微调旨在预测每种类型的 DNA 甲基化状态。这 5 个模型用于集体预测 DNA 甲基化状态。我们报告了 MuLan-Methyl 在基准数据集上的出色性能。此外，我们认为该模型捕获了不同物种之间与甲基化相关的特征差异。这项工作表明，语言模型可以成功地应用于生物序列分析中的应用，并且联合使用不同的语言模型可以提高模型性能。Mulan-Methyl 是开源的，我们提供了一个实现该方法的网络服务器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2db8/10367125/e2842f57994c/giad054fig1.jpg

相似文献

MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction.木兰-甲基-多变压器语言模型，用于准确预测 DNA 甲基化。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad054. Epub 2023 Jul 25.

DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features.DeepSF-4mC：一种利用序列特征预测 DNA 胞嘧啶 4mC 甲基化位点的深度学习模型。

Comput Biol Med. 2024 Mar;171:108166. doi: 10.1016/j.compbiomed.2024.108166. Epub 2024 Feb 16.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

Multi-scale DNA language model improves 6 mA binding sites prediction.多尺度 DNA 语言模型提高了 6mA 结合位点的预测。

Comput Biol Chem. 2024 Oct;112:108129. doi: 10.1016/j.compbiolchem.2024.108129. Epub 2024 Jul 18.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn：一个基于 Transformer 的模型的医学语言理解工具包。

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning.Deep4mC：通过深度学习对 DNA N4-甲基胞嘧啶位点进行系统评估和计算预测。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa099.

iDNA-ABT: advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization.iDNA-ABT：具有自适应特征和转导信息最大化的先进深度学习模型，用于检测 DNA 甲基化。

Bioinformatics. 2021 Dec 11;37(24):4603-4610. doi: 10.1093/bioinformatics/btab677.

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model.通过大语言模型预测TET和DNMT3基因敲除突变体中差异甲基化的胞嘧啶

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf092.

EpiTEAmDNA: Sequence feature representation via transfer learning and ensemble learning for identifying multiple DNA epigenetic modification types across species.EpiTEAmDNA：通过迁移学习和集成学习进行序列特征表示，以跨物种识别多种 DNA 表观遗传修饰类型。

Comput Biol Med. 2023 Jun;160:107030. doi: 10.1016/j.compbiomed.2023.107030. Epub 2023 May 11.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学：生物信息学中大型语言模型的全面综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Application of artificial intelligence large language models in drug target discovery.人工智能大语言模型在药物靶点发现中的应用。

Front Pharmacol. 2025 Jul 8;16:1597351. doi: 10.3389/fphar.2025.1597351. eCollection 2025.

Integrating multi-omics and machine learning for disease resistance prediction in legumes.整合多组学和机器学习用于豆类抗病性预测

Theor Appl Genet. 2025 Jun 27;138(7):163. doi: 10.1007/s00122-025-04948-2.

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景：任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述

Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景：对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。

Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.

A review on the applications of Transformer-based language models for nucleotide sequence analysis.基于Transformer的语言模型在核苷酸序列分析中的应用综述。

Comput Struct Biotechnol J. 2025 Mar 18;27:1244-1254. doi: 10.1016/j.csbj.2025.03.024. eCollection 2025.

Deciphering genomic codes using advanced natural language processing techniques: a scoping review.使用先进自然语言处理技术解读基因组编码：一项范围综述

J Am Med Inform Assoc. 2025 Apr 1;32(4):761-772. doi: 10.1093/jamia/ocaf029.

Beyond digital twins: the role of foundation models in enhancing the interpretability of multiomics modalities in precision medicine.超越数字孪生：基础模型在提高精准医学中多组学模式的可解释性方面的作用。

FEBS Open Bio. 2025 Aug;15(8):1192-1208. doi: 10.1002/2211-5463.70003. Epub 2025 Feb 24.

Transitioning from wet lab to artificial intelligence: a systematic review of AI predictors in CRISPR.从湿实验室到人工智能的转变：对CRISPR中人工智能预测因子的系统综述

J Transl Med. 2025 Feb 4;23(1):153. doi: 10.1186/s12967-024-06013-w.

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景：任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述

Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

本文引用的文献

Prediction of RNA-protein interactions using a nucleotide language model.使用核苷酸语言模型预测RNA-蛋白质相互作用。

Bioinform Adv. 2022 Apr 7;2(1):vbac023. doi: 10.1093/bioadv/vbac023. eCollection 2022.

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.基于多重序列比对训练的蛋白质语言模型可以学习系统发育关系。

Nat Commun. 2022 Oct 22;13(1):6298. doi: 10.1038/s41467-022-34032-y.

iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations.iDNA-ABF：用于可解释的 DNA 甲基化预测的多尺度深度生物语言学习模型。

Genome Biol. 2022 Oct 17;23(1):219. doi: 10.1186/s13059-022-02780-1.

DeepToA: an ensemble deep-learning approach to predicting the theater of activity of a microbiome.DeepToA：一种用于预测微生物组活动部位的集成深度学习方法。

Bioinformatics. 2022 Oct 14;38(20):4670-4676. doi: 10.1093/bioinformatics/btac584.

Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.基于深度神经网络的 DNA 序列分类研究：超越序列相似性的分类方法

Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.

i6mA-Caps: a CapsuleNet-based framework for identifying DNA N6-methyladenine sites.i6mA-Caps：一种基于胶囊网络的 DNA N6-甲基腺嘌呤位点识别框架。

Bioinformatics. 2022 Aug 10;38(16):3885-3891. doi: 10.1093/bioinformatics/btac434.

Hyb4mC: a hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction.Hyb4mC：一种基于 DNA2vec 的混合模型，用于预测 DNA N4-甲基胞嘧啶位点。

BMC Bioinformatics. 2022 Jun 29;23(1):258. doi: 10.1186/s12859-022-04789-6.

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.卷积和自注意力的融合提高了人类基因组语言模型，以碱基分辨率解释非编码区域。

Nucleic Acids Res. 2022 Aug 12;50(14):e81. doi: 10.1093/nar/gkac326.

Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species.Deep6mAPred：一种基于 CNN 和 Bi-LSTM 的深度学习方法，用于预测跨植物物种的 DNA N6-甲基腺苷位点。

Methods. 2022 Aug;204:142-150. doi: 10.1016/j.ymeth.2022.04.011. Epub 2022 Apr 25.

MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block.MGF6mARice：利用分子图特征和残差块预测水稻中的 DNA N6-甲基腺嘌呤位点。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac082.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

木兰-甲基-多变压器语言模型，用于准确预测 DNA 甲基化。

MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献