Department of Biomedical Engineering, Rutgers University, Piscataway, New Jersey, USA.
BMC Bioinformatics. 2011 Dec 19;12:483. doi: 10.1186/1471-2105-12-483.
Multimodal data, especially imaging and non-imaging data, is being routinely acquired in the context of disease diagnostics; however, computational challenges have limited the ability to quantitatively integrate imaging and non-imaging data channels with different dimensionalities and scales. To the best of our knowledge relatively few attempts have been made to quantitatively fuse such data to construct classifiers and none have attempted to quantitatively combine histology (imaging) and proteomic (non-imaging) measurements for making diagnostic and prognostic predictions. The objective of this work is to create a common subspace to simultaneously accommodate both the imaging and non-imaging data (and hence data corresponding to different scales and dimensionalities), called a metaspace. This metaspace can be used to build a meta-classifier that produces better classification results than a classifier that is based on a single modality alone. Canonical Correlation Analysis (CCA) and Regularized CCA (RCCA) are statistical techniques that extract correlations between two modes of data to construct a homogeneous, uniform representation of heterogeneous data channels. In this paper, we present a novel modification to CCA and RCCA, Supervised Regularized Canonical Correlation Analysis (SRCCA), that (1) enables the quantitative integration of data from multiple modalities using a feature selection scheme, (2) is regularized, and (3) is computationally cheap. We leverage this SRCCA framework towards the fusion of proteomic and histologic image signatures for identifying prostate cancer patients at the risk of 5 year biochemical recurrence following radical prostatectomy.
A cohort of 19 grade, stage matched prostate cancer patients, all of whom had radical prostatectomy, including 10 of whom had biochemical recurrence within 5 years of surgery and 9 of whom did not, were considered in this study. The aim was to construct a lower fused dimensional metaspace comprising both the histological and proteomic measurements obtained from the site of the dominant nodule on the surgical specimen. In conjunction with SRCCA, a random forest classifier was able to identify prostate cancer patients, who developed biochemical recurrence within 5 years, with a maximum classification accuracy of 93%.
The classifier performance in the SRCCA space was found to be statistically significantly higher compared to the fused data representations obtained, not only from CCA and RCCA, but also two other statistical techniques called Principal Component Analysis and Partial Least Squares Regression. These results suggest that SRCCA is a computationally efficient and a highly accurate scheme for representing multimodal (histologic and proteomic) data in a metaspace and that it could be used to construct fused biomarkers for predicting disease recurrence and prognosis.
多模态数据,特别是影像学和非影像学数据,在疾病诊断中经常被获取;然而,计算方面的挑战限制了定量整合具有不同维度和规模的成像和非成像数据通道的能力。据我们所知,相对较少的尝试已经被用于定量融合此类数据以构建分类器,并且没有尝试将组织学(成像)和蛋白质组学(非成像)测量值定量结合起来进行诊断和预后预测。这项工作的目的是创建一个通用子空间,同时容纳成像和非成像数据(因此,对应于不同尺度和维度的数据),称为元空间。这个元空间可以用于构建元分类器,该分类器产生比仅基于单一模态的分类器更好的分类结果。典型相关分析(CCA)和正则化 CCA(RCCA)是提取两种数据模式之间相关性的统计技术,用于构建异构数据通道的均匀、统一表示。在本文中,我们提出了 CCA 和 RCCA 的一种新的改进,即监督正则化典型相关分析(SRCCA),该方法(1)通过特征选择方案实现了来自多个模态的数据的定量集成,(2)正则化,(3)计算成本低廉。我们利用这个 SRCCA 框架融合蛋白质组学和组织学图像特征,以识别接受根治性前列腺切除术的前列腺癌患者 5 年内生化复发的风险。
本研究考虑了一组 19 名分级、分期匹配的前列腺癌患者,他们均接受了根治性前列腺切除术,其中 10 名患者在术后 5 年内发生生化复发,9 名患者未发生生化复发。目的是构建一个较低的融合维度元空间,该空间包含来自手术标本优势结节部位获得的组织学和蛋白质组学测量值。与 SRCCA 结合,随机森林分类器能够识别出在 5 年内发生生化复发的前列腺癌患者,其最大分类准确率为 93%。
与从 CCA、RCCA 和另外两种统计技术(主成分分析和偏最小二乘回归)获得的融合数据表示相比,SRCCA 空间中的分类器性能被发现具有统计学意义上的显著提高。这些结果表明,SRCCA 是一种计算效率高、精度高的方法,可用于在元空间中表示多模态(组织学和蛋白质组学)数据,并且可以用于构建融合生物标志物来预测疾病复发和预后。