Long Weicai, Su Houcheng, Xiong Jiaqi, Zhang Yanlin
Data Science and Analytics Thrust, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511453, China.
Bioinformatics. 2025 Jul 1;41(Supplement_1):i294-i303. doi: 10.1093/bioinformatics/btaf229.
Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations. Additionally, training on the entire dataset without prioritizing regions of genetic variation results in inefficiencies and negligible gains in performance.
We present MutBERT, a probabilistic genome-based masked language model that efficiently utilizes SNP information from population-scale genomic data. By representing the entire genome as a probabilistic distribution over observed allele frequencies, MutBERT focuses on informative genomic variations while maintaining computational efficiency. We evaluated MutBERT against DNABERT-2, various versions of Nucleotide Transformer, and modified versions of MutBERT across multiple downstream prediction tasks. MutBERT consistently ranked as one of the top-performing models, demonstrating that this novel representation strategy enables better utilization of biobank-scale genomic data in building pretrained genomic foundation models.
理解人类多样性和疾病的基因组基础需要能够有效捕捉序列变异的模型,例如单核苷酸多态性(SNP)。虽然最近的基因组基础模型已经能够处理更大的数据集和多物种输入,但它们往往无法考虑人类群体数据中固有的稀疏性和冗余性,比如千人基因组计划中的数据。SNP在人类中很罕见,直接在全基因组序列上训练的当前掩码语言模型(MLM)可能难以有效地学习这些变异。此外,在整个数据集上进行训练而不优先考虑遗传变异区域会导致效率低下,且性能提升微不足道。
我们提出了MutBERT,这是一种基于概率基因组的掩码语言模型,它能有效利用群体规模基因组数据中的SNP信息。通过将整个基因组表示为观察到的等位基因频率的概率分布,MutBERT专注于信息丰富的基因组变异,同时保持计算效率。我们在多个下游预测任务中,将MutBERT与DNABERT - 2、各种版本的核苷酸变换器以及MutBERT的修改版本进行了评估。MutBERT始终位列表现最佳的模型之一,表明这种新颖的表示策略能够在构建预训练基因组基础模型时更好地利用生物样本库规模的基因组数据。