IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France.
TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i111-i120. doi: 10.1093/bioinformatics/btad239.
Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.
This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.
All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.
由于高通量和成本较低的测序方法,转录组学数据变得更容易获取。然而,数据稀缺性阻碍了充分利用深度学习模型对表型进行预测的能力。人为地增强训练集,即数据扩充,被认为是一种正则化策略。数据扩充是指对训练集进行标签不变的变换(例如对图像进行几何变换和对文本数据进行语法解析)。不幸的是,在转录组学领域,这些变换是未知的。因此,已经提出了深度生成模型,例如生成对抗网络(GAN),以生成额外的样本。在本文中,我们根据性能指标和癌症表型的分类来分析基于 GAN 的数据扩充策略。
这项工作强调了由于扩充策略,二进制和多类分类性能有了显著提高。没有扩充,仅使用 50 个 RNA-seq 样本训练分类器,对于二进制和组织分类,其准确性分别为 94%和 70%。相比之下,当添加 1000 个扩充样本时,我们实现了 98%和 94%的准确性。更丰富的架构和更昂贵的 GAN 训练总体上会产生更好的扩充性能和生成数据质量。对生成数据的进一步分析表明,需要多个性能指标来正确评估其质量。
本研究使用的所有数据均公开可用,并且来自癌症基因组图谱。可在 GitLab 存储库上获得可重现的代码:https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics。