• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 GAN 的转录组学数据增强:调查和比较评估。

GAN-based data augmentation for transcriptomics: survey and comparative assessment.

机构信息

IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France.

TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France.

出版信息

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i111-i120. doi: 10.1093/bioinformatics/btad239.

DOI:10.1093/bioinformatics/btad239
PMID:37387181
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10311334/
Abstract

MOTIVATION

Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.

RESULTS

This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.

AVAILABILITY AND IMPLEMENTATION

All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.

摘要

动机

由于高通量和成本较低的测序方法,转录组学数据变得更容易获取。然而,数据稀缺性阻碍了充分利用深度学习模型对表型进行预测的能力。人为地增强训练集,即数据扩充,被认为是一种正则化策略。数据扩充是指对训练集进行标签不变的变换(例如对图像进行几何变换和对文本数据进行语法解析)。不幸的是,在转录组学领域,这些变换是未知的。因此,已经提出了深度生成模型,例如生成对抗网络(GAN),以生成额外的样本。在本文中,我们根据性能指标和癌症表型的分类来分析基于 GAN 的数据扩充策略。

结果

这项工作强调了由于扩充策略,二进制和多类分类性能有了显著提高。没有扩充,仅使用 50 个 RNA-seq 样本训练分类器,对于二进制和组织分类,其准确性分别为 94%和 70%。相比之下,当添加 1000 个扩充样本时,我们实现了 98%和 94%的准确性。更丰富的架构和更昂贵的 GAN 训练总体上会产生更好的扩充性能和生成数据质量。对生成数据的进一步分析表明,需要多个性能指标来正确评估其质量。

可用性和实现

本研究使用的所有数据均公开可用,并且来自癌症基因组图谱。可在 GitLab 存储库上获得可重现的代码:https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/96792781bf48/btad239f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/3839855a617f/btad239f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/e52b86e7e7cc/btad239f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/71fa8ea8fe76/btad239f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/4700de3a7e4b/btad239f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/96792781bf48/btad239f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/3839855a617f/btad239f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/e52b86e7e7cc/btad239f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/71fa8ea8fe76/btad239f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/4700de3a7e4b/btad239f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1658/10311334/96792781bf48/btad239f5.jpg

相似文献

1
GAN-based data augmentation for transcriptomics: survey and comparative assessment.基于 GAN 的转录组学数据增强:调查和比较评估。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i111-i120. doi: 10.1093/bioinformatics/btad239.
2
Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation.基于生成式深度学习模型生成批量 RNA-Seq 基因表达数据,并利用其进行数据增强。
Comput Biol Med. 2024 Feb;169:107828. doi: 10.1016/j.compbiomed.2023.107828. Epub 2023 Dec 7.
3
A GAN-based image synthesis method for skin lesion classification.一种基于生成对抗网络的用于皮肤病变分类的图像合成方法。
Comput Methods Programs Biomed. 2020 Oct;195:105568. doi: 10.1016/j.cmpb.2020.105568. Epub 2020 May 29.
4
Data augmentation using Generative Adversarial Networks (GANs) for GAN-based detection of Pneumonia and COVID-19 in chest X-ray images.使用生成对抗网络(GAN)进行数据增强,用于基于GAN的胸部X光图像中肺炎和新冠肺炎的检测。
Inform Med Unlocked. 2021;27:100779. doi: 10.1016/j.imu.2021.100779. Epub 2021 Nov 22.
5
GAN augmentation for multiclass image classification using hemorrhage detection as a case-study.以出血检测为例的多类图像分类的生成对抗网络增强
J Med Imaging (Bellingham). 2022 May;9(3):035504. doi: 10.1117/1.JMI.9.3.035504. Epub 2022 Jun 23.
6
Data Augmentation for Deep-Learning-Based Multiclass Structural Damage Detection Using Limited Information.基于深度学习的利用有限信息的多类别结构损伤检测的数据增强。
Sensors (Basel). 2022 Aug 18;22(16):6193. doi: 10.3390/s22166193.
7
Augmentation of Transcriptomic Data for Improved Classification of Patients with Respiratory Diseases of Viral Origin.转录组数据的增强可改善病毒性呼吸道疾病患者的分类。
Int J Mol Sci. 2022 Feb 24;23(5):2481. doi: 10.3390/ijms23052481.
8
Data augmentation for enhancing EEG-based emotion recognition with deep generative models.基于深度生成模型的数据增强以增强基于 EEG 的情绪识别。
J Neural Eng. 2020 Oct 14;17(5):056021. doi: 10.1088/1741-2552/abb580.
9
Active Appearance Model Induced Generative Adversarial Network for Controlled Data Augmentation.用于可控数据增强的主动外观模型诱导生成对抗网络
Med Image Comput Comput Assist Interv. 2019 Oct;11764:201-208. doi: 10.1007/978-3-030-32239-7_23. Epub 2019 Oct 10.
10
Intelligent phenotype-detection and gene expression profile generation with generative adversarial networks.利用生成对抗网络进行智能表型检测和基因表达谱生成。
J Theor Biol. 2024 Jan 21;577:111636. doi: 10.1016/j.jtbi.2023.111636. Epub 2023 Nov 7.

引用本文的文献

1
The dark matter in cancer immunology: beyond the visible- unveiling multiomics pathways to breakthrough therapies.癌症免疫学中的暗物质:超越可见——揭示通向突破性疗法的多组学途径。
J Transl Med. 2025 Jul 22;23(1):808. doi: 10.1186/s12967-025-06839-y.
2
BioGAN: Enhancing Transcriptomic Data Generation with Biological Knowledge.生物生成对抗网络(BioGAN):利用生物学知识增强转录组数据生成
Bioengineering (Basel). 2025 Jun 16;12(6):658. doi: 10.3390/bioengineering12060658.
3
TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments.

本文引用的文献

1
Text Data Augmentation for Deep Learning.用于深度学习的文本数据增强
J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.
2
Simultaneous deep generative modeling and clustering of single cell genomic data.单细胞基因组数据的同步深度生成建模与聚类
Nat Mach Intell. 2021 Jun;3(6):536-544. doi: 10.1038/s42256-021-00333-y. Epub 2021 May 10.
3
Creating artificial human genomes using generative neural networks.使用生成式神经网络创建人工人类基因组。
转基因选择器:利用一种Transformer方法从小型转录组数据集中挖掘植物对各种环境响应中的关键基因。
BMC Genomics. 2025 Mar 17;26(1):259. doi: 10.1186/s12864-025-11434-y.
4
Opportunities, challenges and future perspectives of using bioinformatics and artificial intelligence techniques on tropical disease identification using omics data.利用生物信息学和人工智能技术通过组学数据进行热带疾病识别的机遇、挑战及未来展望。
Front Digit Health. 2024 Nov 25;6:1471200. doi: 10.3389/fdgth.2024.1471200. eCollection 2024.
5
StructmRNA a BERT based model with dual level and conditional masking for mRNA representation.StructmRNA:一种基于 BERT 的模型,具有双重水平和条件掩蔽,用于 mRNA 表示。
Sci Rep. 2024 Oct 29;14(1):26043. doi: 10.1038/s41598-024-77172-5.
6
Multiorgan locked-state model of chronic diseases and systems pharmacology opportunities.多器官锁定状态模型与慢性疾病和系统药理学机遇
Drug Discov Today. 2024 Jan;29(1):103825. doi: 10.1016/j.drudis.2023.103825. Epub 2023 Nov 13.
PLoS Genet. 2021 Feb 4;17(2):e1009303. doi: 10.1371/journal.pgen.1009303. eCollection 2021 Feb.
4
Adversarial generation of gene expression data.对抗生成基因表达数据。
Bioinformatics. 2022 Jan 12;38(3):730-737. doi: 10.1093/bioinformatics/btab035.
5
RNA sequencing: new technologies and applications in cancer research.RNA 测序:癌症研究中的新技术和应用。
J Hematol Oncol. 2020 Dec 4;13(1):166. doi: 10.1186/s13045-020-01005-x.
6
A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease.生成对抗网络在 RNA-seq 分析中的实际应用,以预测阿尔茨海默病的分子进展。
PLoS Comput Biol. 2020 Jul 24;16(7):e1008099. doi: 10.1371/journal.pcbi.1008099. eCollection 2020 Jul.
7
Improved survival analysis by learning shared genomic information from pan-cancer data.从泛癌数据中学习共享基因组信息以改善生存分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i389-i398. doi: 10.1093/bioinformatics/btaa462.
8
Deep learning models in genomics; are we there yet?基因组学中的深度学习模型;我们做到了吗?
Comput Struct Biotechnol J. 2020 Jun 17;18:1466-1473. doi: 10.1016/j.csbj.2020.06.017. eCollection 2020.
9
scVAE: variational auto-encoders for single-cell gene expression data.scVAE:用于单细胞基因表达数据的变分自动编码器。
Bioinformatics. 2020 Aug 15;36(16):4415-4422. doi: 10.1093/bioinformatics/btaa293.
10
Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks.使用生成对抗网络对单细胞 RNA-seq 数据进行真实的模拟生成和扩充。
Nat Commun. 2020 Jan 9;11(1):166. doi: 10.1038/s41467-019-14018-z.