Schiff Yair, Kao Chia-Hsiang, Gokaslan Aaron, Dao Tri, Gu Albert, Kuleshov Volodymyr
Department of Computer Science, Cornell University, New York, NY USA.
Department of Computer Science, Princeton University, Princeton, NJ USA.
Proc Mach Learn Res. 2024 Jul;235:43632-43648.
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here.
大规模序列建模引发了迅速的进展,如今已扩展到生物学和基因组学领域。然而,对基因组序列进行建模带来了诸多挑战,例如需要对长程 Token 相互作用、基因组上下游区域的影响以及 DNA 的反向互补性(RC)进行建模。在此,我们提出了一种受这些挑战启发的架构,该架构基于长程曼巴模块构建,并将其扩展为支持双向性的双曼巴组件以及额外支持 RC 等变性的曼巴 DNA 模块。我们将曼巴 DNA 用作卡德摩斯(Caduceus)的基础,卡德摩斯是首个 RC 等变双向长程 DNA 语言模型家族,并且我们引入了预训练和微调策略,从而产生了卡德摩斯 DNA 基础模型。在下游基准测试中,卡德摩斯优于先前的长程模型;在一项具有挑战性的长程变异效应预测任务中,卡德摩斯超过了未利用双向性或等变性的更大模型的性能。用于重现我们实验的代码可在此处获取。