评估预训练DNA语言模型在调控基因组学方面的表征能力。

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

作者信息

Tang Ziqi, Somia Nirali, Yu Yiyang, Koo Peter K

机构信息

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA.

The Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA.

出版信息

bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.

DOI:10.1101/2024.02.29.582810

PMID:38464101

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10925287/

Abstract

The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of -regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

摘要

基因组语言模型（gLMs）的出现提供了一种无监督的方法，用于在非编码基因组中学习各种各样的调控模式，而无需湿实验室实验产生的功能活性标签。先前的评估表明，尽管使用的是相对简单的基准数据集和基线模型，但预训练的gLMs可用于提高广泛的调控基因组学任务的预测性能。由于这些研究中的gLMs是在针对每个下游任务微调其权重后进行测试的，因此确定gLMs的表示是否体现了对调控生物学的基本理解仍然是一个悬而未决的问题。在这里，我们评估预训练gLMs预测和解释跨越DNA和RNA调控的细胞类型特异性功能基因组学数据的表示能力。我们的研究结果表明，探究预训练gLMs的表示与使用独热编码序列的传统机器学习方法相比没有实质性优势。这项工作凸显了当前gLMs的一个主要差距，引发了非编码基因组传统预训练策略中的潜在问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b335/11455533/7fb1a1b37964/nihpp-2024.02.29.582810v2-f0001.jpg

相似文献

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学方面的表征能力。

bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学中的表征能力。

Genome Biol. 2025 Jul 14;26(1):203. doi: 10.1186/s13059-025-03674-8.

Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.液体活检能否通过低深度全基因组测序检测肉瘤患者的循环肿瘤DNA？一项初步评估。

Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.

Sexual Harassment and Prevention Training性骚扰与预防培训

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Short-Term Memory Impairment短期记忆障碍

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.成年自闭症患者的就业生活经历：系统检索与综述

Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec.

引用本文的文献

The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models.染色质状态在内含子保留中的作用：利用大规模深度学习模型的案例研究。

PLoS Comput Biol. 2025 Jan 10;21(1):e1012755. doi: 10.1371/journal.pcbi.1012755. eCollection 2025 Jan.

本文引用的文献

RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks.RiNALMo：通用RNA语言模型在结构预测任务上能很好地泛化。

Nat Commun. 2025 Jul 1;16(1):5671. doi: 10.1038/s41467-025-60872-5.

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.墨丘利神杖：双向等变远程DNA序列建模

Proc Mach Learn Res. 2024 Jul;235:43632-43648.

Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model.使用预训练的DNA语言模型在单核苷酸分辨率下对植物基因组进行跨物种建模。

Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2421738122. doi: 10.1073/pnas.2421738122. Epub 2025 Jun 9.

Interpreting -regulatory mechanisms from genomic deep neural networks using surrogate models.使用替代模型从基因组深度神经网络解释调控机制。

Nat Mach Intell. 2024 Jun;6(6):701-713. doi: 10.1038/s42256-024-00851-5. Epub 2024 Jun 21.

GENA-LM: a family of open-source foundational DNA language models for long sequences.GENA-LM：用于长序列的开源基础DNA语言模型家族。

Nucleic Acids Res. 2025 Jan 11;53(2). doi: 10.1093/nar/gkae1310.

Massively parallel characterization of transcriptional regulatory elements.转录调控元件的大规模并行表征

Nature. 2025 Mar;639(8054):411-420. doi: 10.1038/s41586-024-08430-9. Epub 2025 Jan 15.

Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation.将DNA序列预测RNA测序覆盖度作为基因调控的统一模型。

Nat Genet. 2025 Apr;57(4):949-961. doi: 10.1038/s41588-024-02053-6. Epub 2025 Jan 8.

Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器：构建和评估用于人类基因组学的强大基础模型。

Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.

Sequence modeling and design from molecular to genome scale with Evo.基于 Evo 在从分子到基因组尺度上进行序列建模和设计。

Science. 2024 Nov 15;386(6723):eado9336. doi: 10.1126/science.ado9336.

A long-context language model for deciphering and generating bacteriophage genomes.用于破译和生成噬菌体基因组的长语境语言模型。

Nat Commun. 2024 Oct 30;15(1):9392. doi: 10.1038/s41467-024-53759-4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估预训练DNA语言模型在调控基因组学方面的表征能力。

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献