Tang Ziqi, Somia Nirali, Yu Yiyang, Koo Peter K
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA.
The Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA.
bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.
The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of -regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
基因组语言模型(gLMs)的出现提供了一种无监督的方法,用于在非编码基因组中学习各种各样的调控模式,而无需湿实验室实验产生的功能活性标签。先前的评估表明,尽管使用的是相对简单的基准数据集和基线模型,但预训练的gLMs可用于提高广泛的调控基因组学任务的预测性能。由于这些研究中的gLMs是在针对每个下游任务微调其权重后进行测试的,因此确定gLMs的表示是否体现了对调控生物学的基本理解仍然是一个悬而未决的问题。在这里,我们评估预训练gLMs预测和解释跨越DNA和RNA调控的细胞类型特异性功能基因组学数据的表示能力。我们的研究结果表明,探究预训练gLMs的表示与使用独热编码序列的传统机器学习方法相比没有实质性优势。这项工作凸显了当前gLMs的一个主要差距,引发了非编码基因组传统预训练策略中的潜在问题。