Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.
International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad054. Epub 2023 Jul 25.
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
基于转换器的语言模型成功地用于解决大量与文本相关的任务。DNA 甲基化是一种重要的表观遗传机制,其分析为基因调控和生物标志物识别提供了有价值的见解。已经提出了几种基于深度学习的方法来识别 DNA 甲基化,每种方法都在计算工作量和准确性之间寻求平衡。在这里,我们介绍了 MuLan-Methyl,这是一种用于预测 DNA 甲基化位点的深度学习框架,它基于 5 种流行的基于转换器的语言模型。该框架识别了 3 种不同类型的 DNA 甲基化的甲基化位点:N6-腺嘌呤、N4-胞嘧啶和 5-羟甲基胞嘧啶。所使用的每个语言模型都使用“预训练和微调”范式适应任务。预训练是在使用自监督学习的自定义 DNA 片段和分类群语料库上进行的。微调旨在预测每种类型的 DNA 甲基化状态。这 5 个模型用于集体预测 DNA 甲基化状态。我们报告了 MuLan-Methyl 在基准数据集上的出色性能。此外,我们认为该模型捕获了不同物种之间与甲基化相关的特征差异。这项工作表明,语言模型可以成功地应用于生物序列分析中的应用,并且联合使用不同的语言模型可以提高模型性能。Mulan-Methyl 是开源的,我们提供了一个实现该方法的网络服务器。