Linder Johannes, Srivastava Divyanshi, Yuan Han, Agarwal Vikram, Kelley David R
Calico Life Sciences LLC, South San Francisco, CA, USA.
mRNA Center of Excellence, Sanofi Pasteur Inc., Cambridge, MA, USA.
Nat Genet. 2025 Apr;57(4):949-961. doi: 10.1038/s41588-024-02053-6. Epub 2025 Jan 8.
Sequence-based machine-learning models trained on genomics data improve genetic variant interpretation by providing functional predictions describing their impact on the cis-regulatory code. However, current tools do not predict RNA-seq expression profiles because of modeling challenges. Here, we introduce Borzoi, a model that learns to predict cell-type-specific and tissue-specific RNA-seq coverage from DNA sequence. Using statistics derived from Borzoi's predicted coverage, we isolate and accurately score DNA variant effects across multiple layers of regulation, including transcription, splicing and polyadenylation. Evaluated on quantitative trait loci, Borzoi is competitive with and often outperforms state-of-the-art models trained on individual regulatory functions. By applying attribution methods to the derived statistics, we extract cis-regulatory motifs driving RNA expression and post-transcriptional regulation in normal tissues. The wide availability of RNA-seq data across species, conditions and assays profiling specific aspects of regulation emphasizes the potential of this approach to decipher the mapping from DNA sequence to regulatory function.
基于基因组学数据训练的基于序列的机器学习模型,通过提供描述其对顺式调控代码影响的功能预测,改进了遗传变异解释。然而,由于建模挑战,当前工具无法预测RNA测序表达谱。在此,我们引入了Borzoi,这是一种能从DNA序列中学习预测细胞类型特异性和组织特异性RNA测序覆盖度的模型。利用从Borzoi预测覆盖度得出的统计数据,我们在包括转录、剪接和多聚腺苷酸化在内的多层调控中分离并准确评估DNA变异效应。在数量性状基因座上进行评估时,Borzoi与基于个体调控功能训练的最先进模型具有竞争力,且常常表现更优。通过将归因方法应用于得出的统计数据,我们提取了驱动正常组织中RNA表达和转录后调控的顺式调控基序。跨物种、条件和检测特定调控方面的RNA测序数据的广泛可得性,凸显了这种方法在破译从DNA序列到调控功能映射方面的潜力。