一种用于基因组学中数据高效训练的自监督深度学习方法。

A self-supervised deep learning method for data-efficient training in genomics.

机构信息

Department of Statistics, LMU Munich, Munich, Germany.

Munich Center for Machine Learning, Munich, Germany.

出版信息

Commun Biol. 2023 Sep 11;6(1):928. doi: 10.1038/s42003-023-05310-2.

DOI:10.1038/s42003-023-05310-2

PMID:37696966

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10495322/

Abstract

Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.

摘要

生物信息学中的深度学习通常局限于那些有大量标记数据可用于监督分类的问题。通过利用未标记的数据，自监督学习技术可以在标记数据有限的情况下提高机器学习模型的性能。尽管之前已经提出了许多自监督学习方法，但它们未能利用基因组数据的独特特征。因此，我们引入了 Self-GenomeNet，这是一种专门针对基因组数据的自监督学习技术。Self-GenomeNet 利用反向互补序列，并通过预测不同长度的目标来有效地学习短程和长程依赖关系。在数据稀缺的基因组任务中，Self-GenomeNet 的表现优于其他自监督方法，并且使用大约 10 倍少的标记训练数据就能胜过标准监督训练。此外，学习到的表示可以很好地泛化到新的数据集和任务。这些发现表明，Self-GenomeNet 非常适合大规模的、未标记的基因组数据集，并可以显著提高基因组模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4c6/10495322/309ca4177f54/42003_2023_5310_Fig1_HTML.jpg

相似文献

A self-supervised deep learning method for data-efficient training in genomics.一种用于基因组学中数据高效训练的自监督深度学习方法。

Commun Biol. 2023 Sep 11;6(1):928. doi: 10.1038/s42003-023-05310-2.

Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.基于伪标签自训练的局部对比损失的半监督医学图像分割。

Med Image Anal. 2023 Jul;87:102792. doi: 10.1016/j.media.2023.102792. Epub 2023 Mar 11.

Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.利用部分标记的噪声学生自训练和自监督图嵌入探索化学空间。

BMC Bioinformatics. 2022 May 2;23(Suppl 3):158. doi: 10.1186/s12859-022-04681-3.

Self-supervised driven consistency training for annotation efficient histopathology image analysis.用于高效标注组织病理学图像分析的自监督驱动一致性训练

Med Image Anal. 2022 Jan;75:102256. doi: 10.1016/j.media.2021.102256. Epub 2021 Oct 13.

Deep virtual adversarial self-training with consistency regularization for semi-supervised medical image classification.深度对偶对抗自训练与一致性正则化在半监督医学图像分类中的应用。

Med Image Anal. 2021 May;70:102010. doi: 10.1016/j.media.2021.102010. Epub 2021 Feb 22.

Semi Supervised Learning with Deep Embedded Clustering for Image Classification and Segmentation.用于图像分类和分割的深度嵌入聚类半监督学习

IEEE Access. 2019;7:11093-11104. doi: 10.1109/ACCESS.2019.2891970. Epub 2019 Jan 9.

Resolution-based distillation for efficient histology image classification.基于分辨率的蒸馏用于高效的组织学图像分类。

Artif Intell Med. 2021 Sep;119:102136. doi: 10.1016/j.artmed.2021.102136. Epub 2021 Aug 6.

Robust Semi-Supervised Traffic Sign Recognition via Self-Training and Weakly-Supervised Learning.基于自训练和弱监督学习的鲁棒半监督交通标志识别。

Sensors (Basel). 2020 May 8;20(9):2684. doi: 10.3390/s20092684.

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.针对不平衡剪接位点数据集的基于集成的半监督学习方法的实证研究。

BMC Syst Biol. 2015;9 Suppl 5(Suppl 5):S1. doi: 10.1186/1752-0509-9-S5-S1. Epub 2015 Sep 1.

Deep semi-supervised multiple instance learning with self-correction for DME classification from OCT images.用于从光学相干断层扫描（OCT）图像中进行糖尿病性黄斑水肿（DME）分类的带自我校正的深度半监督多实例学习

Med Image Anal. 2023 Jan;83:102673. doi: 10.1016/j.media.2022.102673. Epub 2022 Oct 26.

引用本文的文献

A Self-Supervised Pre-Trained Transformer Model for Accurate Genomic Prediction of Swine Phenotypes.一种用于猪表型准确基因组预测的自监督预训练Transformer模型。

Animals (Basel). 2025 Aug 24;15(17):2485. doi: 10.3390/ani15172485.

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学中的表征能力。

Genome Biol. 2025 Jul 14;26(1):203. doi: 10.1186/s13059-025-03674-8.

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.墨丘利神杖：双向等变远程DNA序列建模

Proc Mach Learn Res. 2024 Jul;235:43632-43648.

Machine Learning-Based Spectral Analyses for Cultivar Identification.基于机器学习的品种鉴定光谱分析

Molecules. 2025 Jan 25;30(3):546. doi: 10.3390/molecules30030546.

Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery.植物转录因子结合区域发现中计算方法的综合分析

Heliyon. 2024 Oct 10;10(20):e39140. doi: 10.1016/j.heliyon.2024.e39140. eCollection 2024 Oct 30.

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.评估预训练DNA语言模型在调控基因组学方面的表征能力。

bioRxiv. 2024 Sep 25:2024.02.29.582810. doi: 10.1101/2024.02.29.582810.

本文引用的文献

Identification of core and rare species in metagenome samples based on shotgun metagenomic sequencing, Fourier transforms and spectral comparisons.基于鸟枪法宏基因组测序、傅里叶变换和光谱比较鉴定宏基因组样本中的核心物种和稀有物种。

ISME Commun. 2021 Mar 24;1(1):2. doi: 10.1038/s43705-021-00010-6.

Contrastive self-supervised clustering of scRNA-seq data.单细胞 RNA 测序数据的对比自监督聚类。

BMC Bioinformatics. 2021 May 27;22(1):280. doi: 10.1186/s12859-021-04210-8.

Structure-based protein function prediction using graph convolutional networks.基于结构的蛋白质功能预测使用图卷积网络。

Nat Commun. 2021 May 26;12(1):3168. doi: 10.1038/s41467-021-23303-9.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data.FactorNet：一种从核苷酸分辨率序列数据预测细胞类型特异性转录因子结合的深度学习框架。

Methods. 2019 Aug 15;166:40-47. doi: 10.1016/j.ymeth.2019.03.020. Epub 2019 Mar 26.

Deep learning models for bacteria taxonomic classification of metagenomic data.基于深度学习的宏基因组数据细菌分类学分类模型

BMC Bioinformatics. 2018 Jul 9;19(Suppl 7):198. doi: 10.1186/s12859-018-2182-6.

Sequential regulatory activity prediction across chromosomes with convolutional neural networks.基于卷积神经网络的跨染色体顺序调控活性预测

Genome Res. 2018 May;28(5):739-750. doi: 10.1101/gr.227819.117. Epub 2018 Mar 27.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.DanQ：一种用于量化DNA序列功能的卷积与循环相结合的深度神经网络。

Nucleic Acids Res. 2016 Jun 20;44(11):e107. doi: 10.1093/nar/gkw226. Epub 2016 Apr 15.

GenBank.基因银行

Nucleic Acids Res. 2016 Jan 4;44(D1):D67-72. doi: 10.1093/nar/gkv1276. Epub 2015 Nov 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于基因组学中数据高效训练的自监督深度学习方法。

A self-supervised deep learning method for data-efficient training in genomics.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献