Suppr超能文献

评估预训练DNA语言模型在调控基因组学方面的表征能力。

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

作者信息

Tang Ziqi, Somia Nirali, Yu Yiyang, Koo Peter K

机构信息

Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA.

The Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA.

出版信息

bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.

Abstract

The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of -regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

摘要

基因组语言模型(gLMs)的出现提供了一种无监督的方法,用于在非编码基因组中学习各种各样的调控模式,而无需湿实验室实验产生的功能活性标签。先前的评估表明,尽管使用的是相对简单的基准数据集和基线模型,但预训练的gLMs可用于提高广泛的调控基因组学任务的预测性能。由于这些研究中的gLMs是在针对每个下游任务微调其权重后进行测试的,因此确定gLMs的表示是否体现了对调控生物学的基本理解仍然是一个悬而未决的问题。在这里,我们评估预训练gLMs预测和解释跨越DNA和RNA调控的细胞类型特异性功能基因组学数据的表示能力。我们的研究结果表明,探究预训练gLMs的表示与使用独热编码序列的传统机器学习方法相比没有实质性优势。这项工作凸显了当前gLMs的一个主要差距,引发了非编码基因组传统预训练策略中的潜在问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b335/11455533/7fb1a1b37964/nihpp-2024.02.29.582810v2-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验