墨丘利神杖：双向等变远程DNA序列建模

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.

作者信息

Schiff Yair, Kao Chia-Hsiang, Gokaslan Aaron, Dao Tri, Gu Albert, Kuleshov Volodymyr

机构信息

Department of Computer Science, Cornell University, New York, NY USA.

Department of Computer Science, Princeton University, Princeton, NJ USA.

出版信息

Proc Mach Learn Res. 2024 Jul;235:43632-43648.

PMID:40567809

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12189541/

Abstract

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here.

摘要

大规模序列建模引发了迅速的进展，如今已扩展到生物学和基因组学领域。然而，对基因组序列进行建模带来了诸多挑战，例如需要对长程 Token 相互作用、基因组上下游区域的影响以及 DNA 的反向互补性（RC）进行建模。在此，我们提出了一种受这些挑战启发的架构，该架构基于长程曼巴模块构建，并将其扩展为支持双向性的双曼巴组件以及额外支持 RC 等变性的曼巴 DNA 模块。我们将曼巴 DNA 用作卡德摩斯（Caduceus）的基础，卡德摩斯是首个 RC 等变双向长程 DNA 语言模型家族，并且我们引入了预训练和微调策略，从而产生了卡德摩斯 DNA 基础模型。在下游基准测试中，卡德摩斯优于先前的长程模型；在一项具有挑战性的长程变异效应预测任务中，卡德摩斯超过了未利用双向性或等变性的更大模型的性能。用于重现我们实验的代码可在此处获取。

相似文献

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.墨丘利神杖：双向等变远程DNA序列建模

Proc Mach Learn Res. 2024 Jul;235:43632-43648.

Interventions to reduce harm from continued tobacco use.减少持续吸烟危害的干预措施。

Cochrane Database Syst Rev. 2016 Oct 13;10(10):CD005231. doi: 10.1002/14651858.CD005231.pub3.

Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略

Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

Interventions to improve hearing aid use in adult auditory rehabilitation.改善成人听觉康复中助听器使用情况的干预措施。

Cochrane Database Syst Rev. 2016 Aug 18;2016(8):CD010342. doi: 10.1002/14651858.CD010342.pub3.

Incentives for preventing smoking in children and adolescents.预防儿童和青少年吸烟的激励措施。

Cochrane Database Syst Rev. 2017 Jun 6;6(6):CD008645. doi: 10.1002/14651858.CD008645.pub3.

Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage.在预测RNA测序读段覆盖度方面，选择性状态空间模型优于Transformer模型。

bioRxiv. 2025 Feb 17:2025.02.13.638190. doi: 10.1101/2025.02.13.638190.

EORTC guidelines for the use of erythropoietic proteins in anaemic patients with cancer: 2006 update.欧洲癌症研究与治疗组织（EORTC）癌症贫血患者促红细胞生成蛋白使用指南：2006年更新版

Eur J Cancer. 2007 Jan;43(2):258-70. doi: 10.1016/j.ejca.2006.10.014. Epub 2006 Dec 19.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

"Just Ask What Support We Need": Autistic Adults' Feedback on Social Skills Training.“只需询问我们需要什么支持”：成年自闭症患者对社交技能培训的反馈

Autism Adulthood. 2025 May 28;7(3):283-292. doi: 10.1089/aut.2023.0136. eCollection 2025 Jun.

Automated monitoring compared to standard care for the early detection of sepsis in critically ill patients.与标准护理相比，自动监测用于危重症患者脓毒症的早期检测

Cochrane Database Syst Rev. 2018 Jun 25;6(6):CD012404. doi: 10.1002/14651858.CD012404.pub2.

引用本文的文献

Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics.使用变异体预训练基因组语言模型以更好地建模功能基因组学。

bioRxiv. 2025 Aug 23:2025.02.26.640468. doi: 10.1101/2025.02.26.640468.

Tokenization and deep learning architectures in genomics: A comprehensive review.基因组学中的词法分析与深度学习架构：全面综述

Comput Struct Biotechnol J. 2025 Jul 28;27:3547-3555. doi: 10.1016/j.csbj.2025.07.038. eCollection 2025.

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA.DART-Eval：一个关于调控DNA的全面DNA语言模型评估基准。

ArXiv. 2025 Aug 4:arXiv:2412.05430v2.

The Impact of Tokenizer Selection in Genomic Language Models.基因组语言模型中分词器选择的影响

bioRxiv. 2025 Jul 26:2024.09.09.612081. doi: 10.1101/2024.09.09.612081.

In silico prediction of variant effects: promises and limitations for precision plant breeding.变异效应的计算机模拟预测：精准植物育种的前景与局限

Theor Appl Genet. 2025 Jul 28;138(8):193. doi: 10.1007/s00122-025-04973-1.

MambaCpG: an accurate model for single-cell DNA methylation status imputation using mamba.曼巴CpG：一种使用曼巴进行单细胞DNA甲基化状态插补的精确模型。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf360.

MutBERT: probabilistic genome representation improves genomics foundation models.MutBERT：概率基因组表示法改进了基因组学基础模型。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i294-i303. doi: 10.1093/bioinformatics/btaf229.

Generating synthetic genotypes using diffusion models.使用扩散模型生成合成基因型。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i484-i492. doi: 10.1093/bioinformatics/btaf209.

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学中的表征能力。

Genome Biol. 2025 Jul 14;26(1):203. doi: 10.1186/s13059-025-03674-8.

MambaCAttnGCN+: a comprehensive framework integrating MambaTextCNN, cross-attention and graph convolution network for piRNA-disease association prediction.曼巴注意力增强卷积网络（MambaCAttnGCN+）：一种整合曼巴文本卷积神经网络（MambaTextCNN）、交叉注意力和图卷积网络的综合框架，用于piRNA-疾病关联预测。

Sci Rep. 2025 Jul 11;15(1):25058. doi: 10.1038/s41598-025-07641-y.

本文引用的文献

GENA-LM: a family of open-source foundational DNA language models for long sequences.GENA-LM：用于长序列的开源基础DNA语言模型家族。

Nucleic Acids Res. 2025 Jan 11;53(2). doi: 10.1093/nar/gkae1310.

Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器：构建和评估用于人类基因组学的强大基础模型。

Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.

DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。

Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.

A self-supervised deep learning method for data-efficient training in genomics.一种用于基因组学中数据高效训练的自监督深度学习方法。

Commun Biol. 2023 Sep 11;6(1):928. doi: 10.1038/s42003-023-05310-2.

A simple new approach to variable selection in regression, with application to genetic fine mapping.一种用于回归中变量选择的简单新方法及其在基因精细定位中的应用。

J R Stat Soc Series B Stat Methodol. 2020 Dec;82(5):1273-1300. doi: 10.1111/rssb.12388. Epub 2020 Jul 10.

Genomic benchmarks: a collection of datasets for genomic sequence classification.基因组基准测试：一组用于基因组序列分类的数据集。

BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

A deep learning framework for enhancer prediction using word embedding and sequence generation.一种使用词嵌入和序列生成进行增强子预测的深度学习框架。

Biophys Chem. 2022 Jul;286:106822. doi: 10.1016/j.bpc.2022.106822. Epub 2022 May 5.

Spliceator: multi-species splice site prediction using convolutional neural networks.Spliceator：使用卷积神经网络进行多物种剪接位点预测。

BMC Bioinformatics. 2021 Nov 23;22(1):561. doi: 10.1186/s12859-021-04471-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验