利用微调的大语言模型鉴定水稻基因组中的DNA N6-甲基腺嘌呤修饰

Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model.

作者信息

Zhang Yichi, Chen Hao, Xiang Shicheng, Lv Zhibin

机构信息

College of Biomedical Engineering, Sichuan University, Chengdu, China.

出版信息

Front Plant Sci. 2025 Jun 25;16:1626539. doi: 10.3389/fpls.2025.1626539. eCollection 2025.

DOI:10.3389/fpls.2025.1626539

PMID:40636005

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12239757/

Abstract

DNA N6-methyladenine (6mA) plays a significant role in various biological processes. In the rice genome, 6mA is involved in important processes such as growth and development, influencing gene expression. Therefore, identifying the 6mA locus in rice is crucial for understanding its complex gene expression regulatory system. Although several useful prediction models have been proposed, there is still room for improvement. To address this, we propose an architecture named iRice6mA-LMXGB that integrates a fine-tuned large language model to identify the 6mA locus in rice. Specifically, our method consists of two main components: (1) a BERT model for feature extraction and (2) an XGBoost module for 6mA classification. We utilize a pre-trained DNABERT-2 model to initialize the parameters of the BERT component. Through transfer learning, we fine-tune the model on the rice 6mA recognition task, converting raw DNA sequences into high-dimensional feature vectors. These features are then processed by an XGBoost algorithm to generate predictions. To further validate the effectiveness of our fine-tuning strategy, we employ UMAP(Uniform Manifold Approximation and Projection) visualization. Our approach achieves a validation accuracy of 0.9903 in a five-fold cross-validation setting and produces a receiver operating characteristic (ROC) curve with an area under the curve (AUC) of 0.9994. Compared to existing predictors trained on the same dataset, our method demonstrates superior performance. This study provides a powerful tool for advancing research in rice 6mA epigenetics.

摘要

DNA N6-甲基腺嘌呤（6mA）在各种生物过程中发挥着重要作用。在水稻基因组中，6mA参与生长和发育等重要过程，影响基因表达。因此，识别水稻中的6mA位点对于理解其复杂的基因表达调控系统至关重要。尽管已经提出了几种有用的预测模型，但仍有改进的空间。为了解决这个问题，我们提出了一种名为iRice6mA-LMXGB的架构，该架构集成了一个微调的大语言模型来识别水稻中的6mA位点。具体来说，我们的方法由两个主要部分组成：（1）用于特征提取的BERT模型和（2）用于6mA分类的XGBoost模块。我们利用预训练的DNABERT-2模型来初始化BERT组件的参数。通过迁移学习，我们在水稻6mA识别任务上对模型进行微调，将原始DNA序列转换为高维特征向量。然后，这些特征由XGBoost算法处理以生成预测。为了进一步验证我们微调策略的有效性，我们采用UMAP（均匀流形近似和投影）可视化。在五折交叉验证设置中，我们的方法实现了0.9903的验证准确率，并生成了曲线下面积（AUC）为0.9994的受试者工作特征（ROC）曲线。与在同一数据集上训练的现有预测器相比，我们的方法表现出卓越的性能。这项研究为推进水稻6mA表观遗传学研究提供了一个强大的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ba3/12239757/cb08990d335d/fpls-16-1626539-g001.jpg

相似文献

Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model.利用微调的大语言模型鉴定水稻基因组中的DNA N6-甲基腺嘌呤修饰

Front Plant Sci. 2025 Jun 25;16:1626539. doi: 10.3389/fpls.2025.1626539. eCollection 2025.

Short-Term Memory Impairment短期记忆障碍

A BERT-based rice enhancer identification model combined with sequence-representation differential entropy interpretation.一种基于BERT的水稻增强子识别模型与序列表征微分熵解释相结合。

Front Plant Sci. 2025 Jun 9;16:1618174. doi: 10.3389/fpls.2025.1618174. eCollection 2025.

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Algorithmic Classification of Psychiatric Disorder-Related Spontaneous Communication Using Large Language Model Embeddings: Algorithm Development and Validation.使用大语言模型嵌入对精神障碍相关自发交流进行算法分类：算法开发与验证

JMIR AI. 2025 May 30;4:e67369. doi: 10.2196/67369.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗？

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Trajectory-Ordered Objectives for Self-Supervised Representation Learning of Temporal Healthcare Data Using Transformers: Model Development and Evaluation Study.使用Transformer进行时间序列医疗数据自监督表示学习的轨迹有序目标：模型开发与评估研究

JMIR Med Inform. 2025 Jun 4;13:e68138. doi: 10.2196/68138.

本文引用的文献

PMPred-AE: a computational model for the detection and interpretation of pathological myopia based on artificial intelligence.PMPred-AE：一种基于人工智能的病理性近视检测与解读计算模型。

Front Med (Lausanne). 2025 Mar 13;12:1529335. doi: 10.3389/fmed.2025.1529335. eCollection 2025.

Methyl-GP: accurate generic DNA methylation prediction based on a language model and representation learning.甲基化基因组图谱（Methyl-GP）：基于语言模型和表征学习的准确通用DNA甲基化预测

Nucleic Acids Res. 2025 Mar 20;53(6). doi: 10.1093/nar/gkaf223.

Protein Language Pragmatic Analysis and Progressive Transfer Learning for Profiling Peptide-Protein Interactions.用于分析肽-蛋白质相互作用的蛋白质语言语用分析与渐进式迁移学习

IEEE Trans Neural Netw Learn Syst. 2025 Mar 18;PP. doi: 10.1109/TNNLS.2025.3540291.

Accurate RNA velocity estimation based on multibatch network reveals complex lineage in batch scRNA-seq data.基于多批次网络的准确RNA速度估计揭示了批次单细胞RNA测序数据中的复杂谱系。

BMC Biol. 2024 Dec 18;22(1):290. doi: 10.1186/s12915-024-02085-8.

PDLLMs: A group of tailored DNA large language models for analyzing plant genomes.PDLLMs：一组用于分析植物基因组的定制化DNA大语言模型。

Mol Plant. 2025 Feb 3;18(2):175-178. doi: 10.1016/j.molp.2024.12.006. Epub 2024 Dec 9.

Improving rice grain shape through upstream ORF editing-mediated translation regulation.通过上游开放阅读框编辑介导的翻译调控改善水稻粒形。

Plant Physiol. 2024 Dec 23;197(1). doi: 10.1093/plphys/kiae557.

Deep learning in template-free de novo biosynthetic pathway design of natural products.无模板的天然产物从头生物合成途径设计中的深度学习。

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae495.

Identifying nucleotide-binding leucine-rich repeat receptor and pathogen effector pairing using transfer-learning and bilinear attention network.利用迁移学习和双线性注意网络识别核苷酸结合富含亮氨酸重复受体和病原体效应子的配对。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae581.

A Foundation Model Identifies Broad-Spectrum Antimicrobial Peptides against Drug-Resistant Bacterial Infection.一种基础模型可识别针对耐药性细菌感染的广谱抗菌肽。

Nat Commun. 2024 Aug 30;15(1):7538. doi: 10.1038/s41467-024-51933-2.

Identification of microbe-disease signed associations via multi-scale variational graph autoencoder based on signed message propagation.基于有向消息传播的多尺度变分图自动编码器识别微生物-疾病签名关联。

BMC Biol. 2024 Aug 15;22(1):172. doi: 10.1186/s12915-024-01968-0.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用微调的大语言模型鉴定水稻基因组中的DNA N6-甲基腺嘌呤修饰

Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献