Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, IL, USA.
IBM Research Almaden, San Jose, 95120, CA, USA.
Artif Intell Med. 2024 Mar;149:102787. doi: 10.1016/j.artmed.2024.102787. Epub 2024 Jan 26.
Traditional approaches to predicting breast cancer patients' survival outcomes were based on clinical subgroups, the PAM50 genes, or the histological tissue's evaluation. With the growth of multi-modality datasets capturing diverse information (such as genomics, histology, radiology and clinical data) about the same cancer, information can be integrated using advanced tools and have improved survival prediction. These methods implicitly exploit the key observation that different modalities originate from the same cancer source and jointly provide a complete picture of the cancer. In this work, we investigate the benefits of explicitly modelling multi-modality data as originating from the same cancer under a probabilistic framework. Specifically, we consider histology and genomics as two modalities originating from the same breast cancer under a probabilistic graphical model (PGM). We construct maximum likelihood estimates of the PGM parameters based on canonical correlation analysis (CCA) and then infer the underlying properties of the cancer patient, such as survival. Equivalently, we construct CCA-based joint embeddings of the two modalities and input them to a learnable predictor. Real-world properties of sparsity and graph-structures are captured in the penalized variants of CCA (pCCA) and are better suited for cancer applications. For generating richer multi-dimensional embeddings with pCCA, we introduce two novel embedding schemes that encourage orthogonality to generate more informative embeddings. The efficacy of our proposed prediction pipeline is first demonstrated via low prediction errors of the hidden variable and the generation of informative embeddings on simulated data. When applied to breast cancer histology and RNA-sequencing expression data from The Cancer Genome Atlas (TCGA), our model can provide survival predictions with average concordance-indices of up to 68.32% along with interpretability. We also illustrate how the pCCA embeddings can be used for survival analysis through Kaplan-Meier curves.
传统的预测乳腺癌患者生存结果的方法基于临床亚组、PAM50 基因或组织学评估。随着多模态数据集的增长,这些数据集捕获了关于同一癌症的各种信息(如基因组学、组织学、放射学和临床数据),可以使用先进的工具整合这些信息,并提高生存预测的能力。这些方法隐含地利用了一个关键观察结果,即不同的模态源于同一癌症来源,并共同提供了癌症的完整图景。在这项工作中,我们在概率框架下研究了明确地将多模态数据建模为源于同一癌症的好处。具体来说,我们将组织学和基因组学视为在概率图模型(PGM)下源于同一乳腺癌的两种模态。我们基于典型相关分析(CCA)构建 PGM 参数的最大似然估计,然后推断癌症患者的潜在特征,如生存情况。或者,我们构建基于 CCA 的两种模态的联合嵌入,并将其输入可学习的预测器中。稀疏性和图结构的真实世界特性在 CCA 的惩罚变体(pCCA)中得到了捕捉,并且更适合癌症应用。为了生成具有 pCCA 的更丰富的多维嵌入,我们引入了两种新的嵌入方案,鼓励正交性以生成更具信息量的嵌入。我们提出的预测管道的有效性首先通过隐藏变量的低预测误差和在模拟数据上生成的信息丰富的嵌入来证明。当应用于来自癌症基因组图谱(TCGA)的乳腺癌组织学和 RNA 测序表达数据时,我们的模型可以提供生存预测,平均一致性指数高达 68.32%,并具有可解释性。我们还通过 Kaplan-Meier 曲线说明了 pCCA 嵌入如何用于生存分析。