IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1290-1302. doi: 10.1109/TCBB.2019.2894635. Epub 2019 Jan 23.
Multimodal data integration is an important framework for cancer subtype discovery as it can blend the inherent properties of individual modalities with their cross-platform correlations to infer clinically relevant subtypes. The main problem here is the appropriate selection of relevant and complementary modalities. Another problem is the 'high dimension-low sample size' nature of each modality. The current research work proposes a novel algorithm to construct a low-rank joint subspace from the low-rank subspaces of individual high-dimensional modalities. Statistical hypothesis testing is introduced to effectively estimate the rank of each modality by separating the signal component from its noise counterpart. Two quantitative indices are proposed to evaluate the quality of different modalities, the first one assesses the degree of relevance of the cluster structure embedded within each modality, while the second measure evaluates the amount of cluster information shared between two modalities. To construct the joint subspace, the algorithm selects the most relevant modalities with maximum shared information. During data integration, the intersection between two subspaces is also considered to select cluster information and filter out the noise from different subspaces. The efficacy of clustering on the joint subspace, extracted by the proposed algorithm, is compared with that of several existing integrative clustering approaches on real-life multimodal data sets. Experimental results show that the identified subtypes have closer resemblance with the clinically established subtypes as compared to the subtypes identified by the existing approaches. Survival analysis has revealed the significant differences between survival profiles of the identified subtypes, while robustness analysis shows that the identified subtypes are not sensitive towards perturbation of the data sets.
多模态数据集成是癌症亚型发现的重要框架,因为它可以融合各个模态的固有属性与其跨平台相关性,从而推断出具有临床相关性的亚型。这里的主要问题是选择合适的相关和互补模态。另一个问题是每个模态的“高维-低样本量”性质。当前的研究工作提出了一种从单个高维模态的低秩子空间构建低秩联合子空间的新算法。通过从噪声对应物中分离信号分量,引入统计假设检验来有效地估计每个模态的秩。提出了两个定量指标来评估不同模态的质量,第一个指标评估每个模态中嵌入的聚类结构的相关性程度,而第二个指标评估两个模态之间共享的聚类信息量。为了构建联合子空间,该算法选择最相关的模态,并选择信息量最大的模态。在数据集成过程中,还考虑了两个子空间之间的交集,以选择聚类信息并从不同的子空间中过滤噪声。将所提出的算法提取的联合子空间上的聚类效果与几种现有的集成聚类方法在真实多模态数据集上的效果进行了比较。实验结果表明,与现有方法相比,所识别的亚型与临床确定的亚型更为相似。生存分析显示了所识别的亚型之间生存曲线的显著差异,而稳健性分析表明,所识别的亚型对数据集的扰动不敏感。