Hingerl Johannes C, Karollus Alexander, Gagneur Julien
School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
Munich Center for Machine Learning, Munich, Germany.
Bioinformatics. 2025 Sep 4. doi: 10.1093/bioinformatics/btaf467.
Accurately predicting how DNA sequence drives gene regulation and how genetic variants alter gene expression is a central challenge in genomics. Borzoi, which models over ten thousand genomic assays including RNA-seq coverage from over half a megabase of sequence context alone promises to become an important foundation model in regulatory genomics, both for massively annotating variants and for further model development. However, the currently used relative positional encodings limit Borzoi's computational efficiency.
We present Flashzoi, an enhanced Borzoi model that leverages rotary positional encodings and FlashAttention-2. This achieves over 3-fold faster training and inference and up to 2.4-fold reduced memory usage, while maintaining or improving accuracy in modeling various genomic assays including RNA-seq coverage, predicting variant effects, and enhancer-promoter linking. Flashzoi's improved efficiency facilitates large-scale genomic analyses and opens avenues for exploring more complex regulatory mechanisms and modeling.
The Flashzoi model architecture is part of the MIT-licensed borzoi-pytorch package, can be found at https://github.com/johahi/borzoi-pytorch and installed via pip. Model weights for all four Flashzoi and Borzoi replicates are available at https://huggingface.co/johahi under the MIT license. The code has been archived at https://zenodo.org/records/15669913.
Supplementary data are available at Bioinformatics online.
准确预测DNA序列如何驱动基因调控以及基因变异如何改变基因表达是基因组学中的核心挑战。Borzoi对一万多种基因组分析进行建模,仅从超过半兆碱基的序列上下文的RNA-seq覆盖范围就能做出预测,有望成为调控基因组学中的重要基础模型,既用于大规模注释变异,也用于进一步的模型开发。然而,目前使用的相对位置编码限制了Borzoi的计算效率。
我们提出了Flashzoi,这是一种增强的Borzoi模型,它利用了旋转位置编码和FlashAttention-2。这使得训练和推理速度提高了3倍多,内存使用量减少了2.4倍,同时在对包括RNA-seq覆盖范围、预测变异效应和增强子-启动子连接在内的各种基因组分析进行建模时保持或提高了准确性。Flashzoi提高的效率促进了大规模基因组分析,并为探索更复杂的调控机制和建模开辟了道路。
Flashzoi模型架构是遵循麻省理工学院许可的borzoi-pytorch包的一部分,可在https://github.com/johahi/borzoi-pytorch上找到,并可通过pip安装。所有四个Flashzoi和Borzoi复制品的模型权重可在https://huggingface.co/johahi上根据麻省理工学院许可获得。代码已存档于https://zenodo.org/records/15669913。
补充数据可在《生物信息学》在线获取。