Ray Bisakha, Liu Wenke, Fenyö David
Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.
Cancer Inform. 2017 Aug 18;16:1176935117725727. doi: 10.1177/1176935117725727. eCollection 2017.
The amounts and types of available multimodal tumor data are rapidly increasing, and their integration is critical for fully understanding the underlying cancer biology and personalizing treatment. However, the development of methods for effectively integrating multimodal data in a principled manner is lagging behind our ability to generate the data. In this article, we introduce an extension to a multiview nonnegative matrix factorization algorithm (NNMF) for dimensionality reduction and integration of heterogeneous data types and compare the predictive modeling performance of the method on unimodal and multimodal data. We also present a comparative evaluation of our novel multiview approach and current data integration methods. Our work provides an efficient method to extend an existing dimensionality reduction method. We report rigorous evaluation of the method on large-scale quantitative protein and phosphoprotein tumor data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) acquired using state-of-the-art liquid chromatography mass spectrometry. Exome sequencing and RNA-Seq data were also available from The Cancer Genome Atlas for the same tumors. For unimodal data, in case of breast cancer, transcript levels were most predictive of estrogen and progesterone receptor status and copy number variation of human epidermal growth factor receptor 2 status. For ovarian and colon cancers, phosphoprotein and protein levels were most predictive of tumor grade and stage and residual tumor, respectively. When multiview NNMF was applied to multimodal data to predict outcomes, the improvement in performance is not overall statistically significant beyond unimodal data, suggesting that proteomics data may contain more predictive information regarding tumor phenotypes than transcript levels, probably due to the fact that proteins are the functional gene products and therefore a more direct measurement of the functional state of the tumor. Here, we have applied our proposed approach to multimodal molecular data for tumors, but it is generally applicable to dimensionality reduction and joint analysis of any type of multimodal data.
可用的多模态肿瘤数据的数量和类型正在迅速增加,而对其进行整合对于全面了解潜在的癌症生物学特性和实现个性化治疗至关重要。然而,以有原则的方式有效整合多模态数据的方法的发展落后于我们生成数据的能力。在本文中,我们介绍了一种多视图非负矩阵分解算法(NNMF)的扩展,用于对异构数据类型进行降维和整合,并比较该方法在单模态和多模态数据上的预测建模性能。我们还对我们新颖的多视图方法和当前的数据整合方法进行了比较评估。我们的工作提供了一种扩展现有降维方法的有效方法。我们报告了对该方法的严格评估,该评估基于使用最先进的液相色谱质谱法从临床蛋白质组肿瘤分析联盟(CPTAC)获得的大规模定量蛋白质和磷酸化蛋白质肿瘤数据。对于相同的肿瘤,癌症基因组图谱中也提供了外显子组测序和RNA测序数据。对于单模态数据,在乳腺癌的情况下,转录水平最能预测雌激素和孕激素受体状态以及人类表皮生长因子受体2状态的拷贝数变异。对于卵巢癌和结肠癌,磷酸化蛋白质和蛋白质水平分别最能预测肿瘤分级和分期以及残留肿瘤。当将多视图NNMF应用于多模态数据以预测结果时,性能的提升在统计学上并不比单模态数据有整体显著差异,这表明蛋白质组学数据可能比转录水平包含更多关于肿瘤表型的预测信息,这可能是因为蛋白质是功能性基因产物,因此是对肿瘤功能状态的更直接测量。在这里,我们将我们提出的方法应用于肿瘤的多模态分子数据,但它通常适用于任何类型多模态数据的降维和联合分析。