Liu Tianyu, Zhang Xiangyu, Lin Jiecong, Pinello Luca, Ying Rex, Zhao Hongyu
bioRxiv. 2025 Aug 23:2025.02.26.640468. doi: 10.1101/2025.02.26.640468.
Sequence-to-function models can predict gene expression from sequence data and be used to link genetic information with transcriptomics data to understand regulatory processes and their effects on complex phenotypes. The genomic language models are pre-trained with large-scale DNA sequences and can generate robust representations of these sequences by learning the genomic context. However, few studies can estimate the predictability of gene expression levels and bridge these two classes of models together to explore individualized gene expression prediction. In this manuscript, we propose UKBioBERT as a DNA language model pre-trained with genetic variants from UK BioBank. We demonstrate that UKBioBERT generates informative embeddings capable of identifying gene functions, and improving gene expression prediction in cell lines, thereby enhancing our understanding of gene expression predictability. Building upon these embeddings, we combine UKBioBERT with state-of-the-art sequence-to-function architectures, Enformer and Borzoi, to create UKBioFormer and UKBioZoi. These models exhibit better performance in predicting highly predictable gene expression levels and can be generalized across different cohorts. Furthermore, UKBioFormer effectively captures the relationship between genetic variants and expression variations, enabling in-silico mutation analyses. Collectively, our findings underscore the value of integrating genomic language models and sequence-to-function approaches for advancing functional genomics research.
序列到功能模型可以从序列数据预测基因表达,并用于将遗传信息与转录组学数据联系起来,以了解调控过程及其对复杂表型的影响。基因组语言模型通过大规模DNA序列进行预训练,并可以通过学习基因组背景来生成这些序列的强大表示。然而,很少有研究能够估计基因表达水平的可预测性,并将这两类模型结合起来以探索个性化的基因表达预测。在本论文中,我们提出将UKBioBERT作为一种用英国生物银行的遗传变异进行预训练的DNA语言模型。我们证明UKBioBERT能够生成能够识别基因功能的信息性嵌入,并改善细胞系中的基因表达预测,从而增强我们对基因表达可预测性的理解。基于这些嵌入,我们将UKBioBERT与最先进的序列到功能架构Enformer和Borzoi相结合,创建了UKBioFormer和UKBioZoi。这些模型在预测高度可预测的基因表达水平方面表现出更好的性能,并且可以推广到不同的队列中。此外,UKBioFormer有效地捕捉了遗传变异与表达变异之间的关系,实现了计算机模拟突变分析。总的来说,我们的研究结果强调了整合基因组语言模型和序列到功能方法对推进功能基因组学研究的价值。