半监督学习结合伪标签在调控序列预测方面优于大型语言模型。

Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.

机构信息

INRAE, MIAT, 31326 Castanet-Tolosan, France.

University of Toulouse, UPS, 31062 Toulouse, France.

出版信息

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae560.

DOI:10.1093/bib/bbae560

PMID:39489607

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11531863/

Abstract

Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.

摘要

使用深度学习预测分子过程是为全基因组关联研究中鉴定的非编码单核苷酸多态性提供生物学见解的一种很有前途的方法。然而，大多数深度学习方法依赖于监督学习，这需要与功能数据相关的 DNA 序列，而由于人类基因组的有限大小，其数量受到严重限制。相反，由于正在进行的大规模测序项目，哺乳动物 DNA 序列的数量呈指数级增长，但在大多数情况下没有功能数据。为了缓解监督学习的局限性，我们提出了一种基于伪标记的新型半监督学习 (SSL)，它允许在模型预训练期间利用来自众多基因组的未标记 DNA 序列。我们进一步改进了它，结合了来自噪声学生算法的原理，以预测用于预训练的伪标记数据的置信度，这对于转录因子的结合（非常小的训练数据）非常少的情况显示出了改进。该方法非常灵活，可以用于训练任何神经网络架构，包括最先进的模型，并且在大多数情况下与标准监督学习相比显示出了强大的预测性能改进。此外，通过 SSL 训练的小型模型表现出与大型语言模型 DNABERT2 相似或更好的性能。