Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.
Department of Medical Biology, The University of Melbourne, Melbourne, Victoria, Australia.
Nat Biotechnol. 2023 Jan;41(1):82-95. doi: 10.1038/s41587-022-01440-w. Epub 2022 Sep 15.
Accurate identification and effective removal of unwanted variation is essential to derive meaningful biological results from RNA sequencing (RNA-seq) data, especially when the data come from large and complex studies. Using RNA-seq data from The Cancer Genome Atlas (TCGA), we examined several sources of unwanted variation and demonstrate here how these can significantly compromise various downstream analyses, including cancer subtype identification, association between gene expression and survival outcomes and gene co-expression analysis. We propose a strategy, called pseudo-replicates of pseudo-samples (PRPS), for deploying our recently developed normalization method, called removing unwanted variation III (RUV-III), to remove the variation caused by library size, tumor purity and batch effects in TCGA RNA-seq data. We illustrate the value of our approach by comparing it to the standard TCGA normalizations on several TCGA RNA-seq datasets. RUV-III with PRPS can be used to integrate and normalize other large transcriptomic datasets coming from multiple laboratories or platforms.
准确识别和有效去除非期望变异对于从 RNA 测序(RNA-seq)数据中得出有意义的生物学结果至关重要,特别是当数据来自大型复杂研究时。我们使用来自癌症基因组图谱(TCGA)的 RNA-seq 数据,研究了几种非期望变异的来源,并在此展示了这些变异如何显著影响各种下游分析,包括癌症亚型识别、基因表达与生存结果之间的关联以及基因共表达分析。我们提出了一种策略,称为伪样本的伪重复(PRPS),用于部署我们最近开发的标准化方法,称为去除非期望变异 III(RUV-III),以去除 TCGA RNA-seq 数据中由文库大小、肿瘤纯度和批次效应引起的变异。我们通过将其与几种 TCGA RNA-seq 数据集的标准 TCGA 标准化方法进行比较,说明了我们方法的价值。带有 PRPS 的 RUV-III 可用于整合和标准化来自多个实验室或平台的其他大型转录组数据集。