Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA.
Genome Biol. 2022 Jun 27;23(1):137. doi: 10.1186/s13059-022-02705-y.
In studies of cellular function in cancer, researchers are increasingly able to choose from many -omics assays as functional readouts. Choosing the correct readout for a given study can be difficult, and which layer of cellular function is most suitable to capture the relevant signal remains unclear.
We consider prediction of cancer mutation status (presence or absence) from functional -omics data as a representative problem that presents an opportunity to quantify and compare the ability of different -omics readouts to capture signals of dysregulation in cancer. From the TCGA Pan-Cancer Atlas that contains genetic alteration data, we focus on RNA sequencing, DNA methylation arrays, reverse phase protein arrays (RPPA), microRNA, and somatic mutational signatures as -omics readouts. Across a collection of genes recurrently mutated in cancer, RNA sequencing tends to be the most effective predictor of mutation state. We find that one or more other data types for many of the genes are approximately equally effective predictors. Performance is more variable between mutations than that between data types for the same mutation, and there is little difference between the top data types. We also find that combining data types into a single multi-omics model provides little or no improvement in predictive ability over the best individual data type.
Based on our results, for the design of studies focused on the functional outcomes of cancer mutations, there are often multiple -omics types that can serve as effective readouts, although gene expression seems to be a reasonable default option.
在癌症的细胞功能研究中,研究人员越来越能够从众多的组学检测中选择作为功能读数。为特定的研究选择正确的读数可能很困难,并且细胞功能的哪一层最适合捕捉相关的失调信号仍不清楚。
我们将从功能组学数据中预测癌症突变状态(存在或不存在)视为一个代表性问题,该问题提供了一个机会来量化和比较不同组学读数捕捉癌症失调信号的能力。从包含遗传改变数据的 TCGA 泛癌图谱中,我们专注于 RNA 测序、DNA 甲基化阵列、反相蛋白阵列 (RPPA)、microRNA 和体细胞突变特征作为组学读数。在一组在癌症中经常发生突变的基因中,RNA 测序往往是突变状态的最有效预测因子。我们发现,对于许多基因,一种或多种其他类型的数据在预测突变状态方面的效果大致相同。在相同的突变之间,性能在突变之间比在数据类型之间更具可变性,并且顶级数据类型之间几乎没有差异。我们还发现,将数据类型组合到单个多组学模型中,对预测能力的提高几乎没有或没有,而不是最佳的单个数据类型。
根据我们的结果,对于专注于癌症突变功能结果的研究设计,通常有多种组学类型可以作为有效的读数,尽管基因表达似乎是一个合理的默认选项。