• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

半监督学习结合伪标签在调控序列预测方面优于大型语言模型。

Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.

机构信息

INRAE, MIAT, 31326 Castanet-Tolosan, France.

University of Toulouse, UPS, 31062 Toulouse, France.

出版信息

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae560.

DOI:10.1093/bib/bbae560
PMID:39489607
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11531863/
Abstract

Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.

摘要

使用深度学习预测分子过程是为全基因组关联研究中鉴定的非编码单核苷酸多态性提供生物学见解的一种很有前途的方法。然而,大多数深度学习方法依赖于监督学习,这需要与功能数据相关的 DNA 序列,而由于人类基因组的有限大小,其数量受到严重限制。相反,由于正在进行的大规模测序项目,哺乳动物 DNA 序列的数量呈指数级增长,但在大多数情况下没有功能数据。为了缓解监督学习的局限性,我们提出了一种基于伪标记的新型半监督学习 (SSL),它允许在模型预训练期间利用来自众多基因组的未标记 DNA 序列。我们进一步改进了它,结合了来自噪声学生算法的原理,以预测用于预训练的伪标记数据的置信度,这对于转录因子的结合(非常小的训练数据)非常少的情况显示出了改进。该方法非常灵活,可以用于训练任何神经网络架构,包括最先进的模型,并且在大多数情况下与标准监督学习相比显示出了强大的预测性能改进。此外,通过 SSL 训练的小型模型表现出与大型语言模型 DNABERT2 相似或更好的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/60dea17b7268/bbae560f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/903247b2ac05/bbae560f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/cfc6c997f735/bbae560f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/60dea17b7268/bbae560f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/903247b2ac05/bbae560f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/cfc6c997f735/bbae560f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a730/11531863/60dea17b7268/bbae560f3.jpg

相似文献

1
Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.半监督学习结合伪标签在调控序列预测方面优于大型语言模型。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae560.
2
Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences.半监督学习利用未标记序列提高调控序列预测。
BMC Bioinformatics. 2023 May 5;24(1):186. doi: 10.1186/s12859-023-05303-2.
3
CPSS: Fusing consistency regularization and pseudo-labeling techniques for semi-supervised deep cardiovascular disease detection using all unlabeled electrocardiograms.CPSS:利用所有未标记的心电图进行半监督深度心血管疾病检测的一致性正则化和伪标记技术融合。
Comput Methods Programs Biomed. 2024 Sep;254:108315. doi: 10.1016/j.cmpb.2024.108315. Epub 2024 Jul 4.
4
A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations.一种用于预测基因组非编码变异功能效应的半监督深度学习方法。
BMC Bioinformatics. 2021 Jun 2;22(Suppl 6):128. doi: 10.1186/s12859-021-03999-8.
5
FaxMatch: Multi-Curriculum Pseudo-Labeling for semi-supervised medical image classification.FaxMatch:用于半监督医学图像分类的多课程伪标签
Med Phys. 2023 May;50(5):3210-3222. doi: 10.1002/mp.16312. Epub 2023 Feb 21.
6
Semantic contrast with uncertainty-aware pseudo label for lumbar semi-supervised classification.基于具有不确定性感知的伪标签的语义对比进行腰椎半监督分类。
Comput Biol Med. 2024 Aug;178:108754. doi: 10.1016/j.compbiomed.2024.108754. Epub 2024 Jun 15.
7
Semi-supervised abdominal multi-organ segmentation by object-redrawing.通过对象重绘实现半监督腹部多器官分割
Med Phys. 2024 Nov;51(11):8334-8347. doi: 10.1002/mp.17364. Epub 2024 Aug 21.
8
Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors.基于 DNA 甲基化的中枢神经系统肿瘤有监督分类的半监督学习综合研究。
BMC Bioinformatics. 2022 Jun 8;23(1):223. doi: 10.1186/s12859-022-04764-1.
9
Deep Source Semi-Supervised Transfer Learning (DS3TL) for Cross-Subject EEG Classification.深度源半监督迁移学习 (DS3TL) 在跨被试 EEG 分类中的应用。
IEEE Trans Biomed Eng. 2024 Apr;71(4):1308-1318. doi: 10.1109/TBME.2023.3333327. Epub 2024 Mar 20.
10
Detecting floating litter in freshwater bodies with semi-supervised deep learning.利用半监督深度学习技术检测淡水体中的漂浮垃圾。
Water Res. 2024 Nov 15;266:122405. doi: 10.1016/j.watres.2024.122405. Epub 2024 Sep 11.

本文引用的文献

1
Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation.利用系统发育增强提高监管基因组学中监督深度学习的性能。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae190.
2
EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow.EvoAug-TF:将基于进化的基因组深度学习数据增强扩展到 TensorFlow。
Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae092.
3
DNA language models are powerful predictors of genome-wide variant effects.
DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.
4
Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences.半监督学习利用未标记序列提高调控序列预测。
BMC Bioinformatics. 2023 May 5;24(1):186. doi: 10.1186/s12859-023-05303-2.
5
EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations.EvoAug:利用受进化启发的数据增强方法提高基因组深度学习神经网络的泛化能力和可解释性。
Genome Biol. 2023 May 5;24(1):105. doi: 10.1186/s13059-023-02941-w.
6
Evolutionary constraint and innovation across hundreds of placental mammals.数百种胎盘哺乳动物的进化约束与创新。
Science. 2023 Apr 28;380(6643):eabn3943. doi: 10.1126/science.abn3943.
7
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
8
Landscape of allele-specific transcription factor binding in the human genome.人类基因组中等位基因特异性转录因子结合的全景
Nat Commun. 2021 May 12;12(1):2751. doi: 10.1038/s41467-021-23007-0.
9
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
10
Sequential regulatory activity prediction across chromosomes with convolutional neural networks.基于卷积神经网络的跨染色体顺序调控活性预测
Genome Res. 2018 May;28(5):739-750. doi: 10.1101/gr.227819.117. Epub 2018 Mar 27.