Institute of Cell Biology and Immunology, University of Stuttgart, Stuttgart, Germany.
ProKanDo GmbH, Ludwigsburg, Germany.
PLoS Comput Biol. 2023 Apr 3;19(4):e1011035. doi: 10.1371/journal.pcbi.1011035. eCollection 2023 Apr.
Established prognostic tests based on limited numbers of transcripts can identify high-risk breast cancer patients, yet are approved only for individuals presenting with specific clinical features or disease characteristics. Deep learning algorithms could hold potential for stratifying patient cohorts based on full transcriptome data, yet the development of robust classifiers is hampered by the number of variables in omics datasets typically far exceeding the number of patients. To overcome this hurdle, we propose a classifier based on a data augmentation pipeline consisting of a Wasserstein generative adversarial network (GAN) with gradient penalty and an embedded auxiliary classifier to obtain a trained GAN discriminator (T-GAN-D). Applied to 1244 patients of the METABRIC breast cancer cohort, this classifier outperformed established breast cancer biomarkers in separating low- from high-risk patients (disease specific death, progression or relapse within 10 years from initial diagnosis). Importantly, the T-GAN-D also performed across independent, merged transcriptome datasets (METABRIC and TCGA-BRCA cohorts), and merging data improved overall patient stratification. In conclusion, the reiterative GAN-based training process allowed generating a robust classifier capable of stratifying low- vs high-risk patients based on full transcriptome data and across independent and heterogeneous breast cancer cohorts.
基于有限数量转录本建立的预后测试可以识别高风险乳腺癌患者,但仅批准用于具有特定临床特征或疾病特征的个体。深度学习算法有可能根据全转录组数据对患者队列进行分层,但由于组学数据集中的变量数量通常远远超过患者数量,因此强大的分类器的开发受到阻碍。为了克服这一障碍,我们提出了一种基于数据增强管道的分类器,该管道由带梯度惩罚的 Wasserstein 生成对抗网络(GAN)和嵌入式辅助分类器组成,以获得经过训练的 GAN 鉴别器(T-GAN-D)。将该分类器应用于 METABRIC 乳腺癌队列的 1244 名患者,该分类器在区分低风险和高风险患者方面优于已建立的乳腺癌生物标志物(从初始诊断起 10 年内疾病特异性死亡、进展或复发)。重要的是,T-GAN-D 还在独立的、合并的转录组数据集(METABRIC 和 TCGA-BRCA 队列)中表现良好,并且合并数据提高了整体患者分层。总之,基于 GAN 的重复训练过程允许生成一种稳健的分类器,能够根据全转录组数据并在独立且异质的乳腺癌队列中对低风险与高风险患者进行分层。