利用微调的大语言模型鉴定水稻基因组中的DNA N6-甲基腺嘌呤修饰
Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model.
作者信息
Zhang Yichi, Chen Hao, Xiang Shicheng, Lv Zhibin
机构信息
College of Biomedical Engineering, Sichuan University, Chengdu, China.
出版信息
Front Plant Sci. 2025 Jun 25;16:1626539. doi: 10.3389/fpls.2025.1626539. eCollection 2025.
DNA N6-methyladenine (6mA) plays a significant role in various biological processes. In the rice genome, 6mA is involved in important processes such as growth and development, influencing gene expression. Therefore, identifying the 6mA locus in rice is crucial for understanding its complex gene expression regulatory system. Although several useful prediction models have been proposed, there is still room for improvement. To address this, we propose an architecture named iRice6mA-LMXGB that integrates a fine-tuned large language model to identify the 6mA locus in rice. Specifically, our method consists of two main components: (1) a BERT model for feature extraction and (2) an XGBoost module for 6mA classification. We utilize a pre-trained DNABERT-2 model to initialize the parameters of the BERT component. Through transfer learning, we fine-tune the model on the rice 6mA recognition task, converting raw DNA sequences into high-dimensional feature vectors. These features are then processed by an XGBoost algorithm to generate predictions. To further validate the effectiveness of our fine-tuning strategy, we employ UMAP(Uniform Manifold Approximation and Projection) visualization. Our approach achieves a validation accuracy of 0.9903 in a five-fold cross-validation setting and produces a receiver operating characteristic (ROC) curve with an area under the curve (AUC) of 0.9994. Compared to existing predictors trained on the same dataset, our method demonstrates superior performance. This study provides a powerful tool for advancing research in rice 6mA epigenetics.
DNA N6-甲基腺嘌呤(6mA)在各种生物过程中发挥着重要作用。在水稻基因组中,6mA参与生长和发育等重要过程,影响基因表达。因此,识别水稻中的6mA位点对于理解其复杂的基因表达调控系统至关重要。尽管已经提出了几种有用的预测模型,但仍有改进的空间。为了解决这个问题,我们提出了一种名为iRice6mA-LMXGB的架构,该架构集成了一个微调的大语言模型来识别水稻中的6mA位点。具体来说,我们的方法由两个主要部分组成:(1)用于特征提取的BERT模型和(2)用于6mA分类的XGBoost模块。我们利用预训练的DNABERT-2模型来初始化BERT组件的参数。通过迁移学习,我们在水稻6mA识别任务上对模型进行微调,将原始DNA序列转换为高维特征向量。然后,这些特征由XGBoost算法处理以生成预测。为了进一步验证我们微调策略的有效性,我们采用UMAP(均匀流形近似和投影)可视化。在五折交叉验证设置中,我们的方法实现了0.9903的验证准确率,并生成了曲线下面积(AUC)为0.9994的受试者工作特征(ROC)曲线。与在同一数据集上训练的现有预测器相比,我们的方法表现出卓越的性能。这项研究为推进水稻6mA表观遗传学研究提供了一个强大的工具。