Suppr超能文献

墨丘利神杖:双向等变远程DNA序列建模

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.

作者信息

Schiff Yair, Kao Chia-Hsiang, Gokaslan Aaron, Dao Tri, Gu Albert, Kuleshov Volodymyr

机构信息

Department of Computer Science, Cornell University, New York, NY USA.

Department of Computer Science, Princeton University, Princeton, NJ USA.

出版信息

Proc Mach Learn Res. 2024 Jul;235:43632-43648.

Abstract

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here.

摘要

大规模序列建模引发了迅速的进展,如今已扩展到生物学和基因组学领域。然而,对基因组序列进行建模带来了诸多挑战,例如需要对长程 Token 相互作用、基因组上下游区域的影响以及 DNA 的反向互补性(RC)进行建模。在此,我们提出了一种受这些挑战启发的架构,该架构基于长程曼巴模块构建,并将其扩展为支持双向性的双曼巴组件以及额外支持 RC 等变性的曼巴 DNA 模块。我们将曼巴 DNA 用作卡德摩斯(Caduceus)的基础,卡德摩斯是首个 RC 等变双向长程 DNA 语言模型家族,并且我们引入了预训练和微调策略,从而产生了卡德摩斯 DNA 基础模型。在下游基准测试中,卡德摩斯优于先前的长程模型;在一项具有挑战性的长程变异效应预测任务中,卡德摩斯超过了未利用双向性或等变性的更大模型的性能。用于重现我们实验的代码可在此处获取。

相似文献

2
Interventions to reduce harm from continued tobacco use.减少持续吸烟危害的干预措施。
Cochrane Database Syst Rev. 2016 Oct 13;10(10):CD005231. doi: 10.1002/14651858.CD005231.pub3.
3
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.
4
Interventions to improve hearing aid use in adult auditory rehabilitation.改善成人听觉康复中助听器使用情况的干预措施。
Cochrane Database Syst Rev. 2016 Aug 18;2016(8):CD010342. doi: 10.1002/14651858.CD010342.pub3.
5
Incentives for preventing smoking in children and adolescents.预防儿童和青少年吸烟的激励措施。
Cochrane Database Syst Rev. 2017 Jun 6;6(6):CD008645. doi: 10.1002/14651858.CD008645.pub3.

引用本文的文献

2
Tokenization and deep learning architectures in genomics: A comprehensive review.基因组学中的词法分析与深度学习架构:全面综述
Comput Struct Biotechnol J. 2025 Jul 28;27:3547-3555. doi: 10.1016/j.csbj.2025.07.038. eCollection 2025.
8
Generating synthetic genotypes using diffusion models.使用扩散模型生成合成基因型。
Bioinformatics. 2025 Jul 1;41(Supplement_1):i484-i492. doi: 10.1093/bioinformatics/btaf209.

本文引用的文献

3
DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验