Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA.
Department of Medicine, Institute for Clinical and Translational Research, Baylor College of Medicine, Houston, TX, USA.
BMC Cancer. 2021 Sep 25;21(1):1053. doi: 10.1186/s12885-021-08796-3.
Over the past decades, approaches for diagnosing and treating cancer have seen significant improvement. However, the variability of patient and tumor characteristics has limited progress on methods for prognosis prediction. The development of high-throughput omics technologies now provides multiple approaches for characterizing tumors. Although a large number of published studies have focused on integration of multi-omics data and use of pathway-level models for cancer prognosis prediction, there still exists a gap of knowledge regarding the prognostic landscape across multi-omics data for multiple cancer types using both gene-level and pathway-level predictors.
In this study, we systematically evaluated three often available types of omics data (gene expression, copy number variation and somatic point mutation) covering both DNA-level and RNA-level features. We evaluated the landscape of predictive performance of these three omics modalities for 33 cancer types in the TCGA using a Lasso or Group Lasso-penalized Cox model and either gene or pathway level predictors.
We constructed the prognostic landscape using three types of omics data for 33 cancer types on both the gene and pathway levels. Based on this landscape, we found that predictive performance is cancer type dependent and we also highlighted the cancer types and omics modalities that support the most accurate prognostic models. In general, models estimated on gene expression data provide the best predictive performance on either gene or pathway level and adding copy number variation or somatic point mutation data to gene expression data does not improve predictive performance, with some exceptional cohorts including low grade glioma and thyroid cancer. In general, pathway-level models have better interpretative performance, higher stability and smaller model size across multiple cancer types and omics data types relative to gene-level models.
Based on this landscape and comprehensively comparison, models estimated on gene expression data provide the best predictive performance on either gene or pathway level. Pathway-level models have better interpretative performance, higher stability and smaller model size relative to gene-level models.
在过去的几十年中,癌症的诊断和治疗方法取得了重大进展。然而,由于患者和肿瘤特征的可变性,预后预测方法的进展受到了限制。高通量组学技术的发展现在为肿瘤特征提供了多种方法。尽管大量已发表的研究集中在整合多组学数据和使用通路级模型进行癌症预后预测上,但对于使用基因水平和通路水平预测因子的多种癌症类型的多组学数据的预后景观,仍然存在知识差距。
在这项研究中,我们系统地评估了三种常见的组学数据(基因表达、拷贝数变异和体细胞点突变),涵盖了 DNA 水平和 RNA 水平的特征。我们使用 Lasso 或 Group Lasso 惩罚 Cox 模型以及基因或通路水平的预测因子,在 TCGA 中对 33 种癌症类型的这三种组学模态的预测性能进行了评估。
我们在基因和通路水平上为 33 种癌症类型构建了基于三种组学数据的预后景观。基于该景观,我们发现预测性能取决于癌症类型,并且还突出了支持最准确预后模型的癌症类型和组学模态。一般来说,基于基因表达数据构建的模型在基因或通路水平上提供了最佳的预测性能,并且向基因表达数据中添加拷贝数变异或体细胞点突变数据不会提高预测性能,除了一些例外的队列,包括低级别神经胶质瘤和甲状腺癌。一般来说,与基因水平模型相比,通路水平模型具有更好的解释性能、更高的稳定性和更小的模型大小,适用于多种癌症类型和组学数据类型。
基于该景观和综合比较,基于基因表达数据构建的模型在基因或通路水平上提供了最佳的预测性能。与基因水平模型相比,通路水平模型具有更好的解释性能、更高的稳定性和更小的模型大小。