基于加权降维和鲁棒高斯混合模型的基因表达数据癌症患者亚型分析。

Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data.

机构信息

Machine Learning Lab, Department of Electronics and Communication Engineering, National Institute of Technology, Srinagar, JK, India.

出版信息

J Biomed Inform. 2020 Dec;112:103620. doi: 10.1016/j.jbi.2020.103620. Epub 2020 Nov 11.

DOI:10.1016/j.jbi.2020.103620

PMID:33188907

Abstract

BACKGROUND

The heterogeneous nature of cancer necessitates subtyping of cancer patients into distinct and well separated subgroups. However, computational issues arise because gene expression data is noisy and contains outliers apart from being high dimensional. As such, an attempt to subtype cancer patients from gene expression data leads to highly overlapping Kaplan-Meier (KM) survival plots and thus clear distinction among the discovered subtypes becomes difficult. Here we attempt to achieve a greater separation among the subtypes through a robust clustering pipeline.

METHODS

We propose a robust framework to achieve a better separation among the discovered subtypes. Our framework is based on dimensionality reduction of a weighted gene expression matrix using t-distributed Stochastic Neighbor Embedding (t-SNE) and a robust Gaussian mixture model based clustering approach. Every gene is weighted according to the median absolute deviation (MAD) of the gene before dimensionality reduction. The results are quantified by measuring the minimum pairwise separation among the KM plots and minimum hazard ratio among the subtypes. We also introduce a novel method, called cumulative survival separation, to quantify the separation among the discovered subtypes.

RESULTS

To validate the proposed methodology we obtained five cancer gene expression datasets from The Cancer Genome Atlas (TCGA) and comparisons with Consensus Clustering (CC), Consensus non-negative matrix factorization (CNMF), fast density-aware spectral clustering (Spectrum) and Neighborhood based Multi-Omics clustering (NEMO) methodologies show that the proposed method is able to achieve a greater separation compared to the aforementioned methods in literature. For instance, the minimum pairwise life expectancy difference (in days) between the discovered subtypes for GBM is 61 days for the proposed methodology with MAD scores, whereas it is approximately 33, 19, 49 and 33 days only for CC, Spectrum, Nemo and CNMF respectively. Comparisons are also shown for the proposed framework with and without using the MAD scores and it is observed that MAD score significantly improves the subtype separation. Hazard ratio analysis also shows that the proposed methodology performs better. Furthermore, pathway over-representation analyses were carried to identify relevant genetic pathways which can be possible targets for treatment.

CONCLUSION

The results suggest that the use of median absolute deviation and a robust clustering methodology are helpful in achieving greater separation among the subtypes with better statistical and clinical significance.

摘要

背景

癌症的异质性需要将癌症患者分为不同的、明显分开的亚组。然而，由于基因表达数据存在噪声和异常值，并且维度较高，因此在尝试根据基因表达数据对癌症患者进行亚型分类时会出现计算问题。因此，试图从基因表达数据中对癌症患者进行亚型分类会导致 Kaplan-Meier（KM）生存曲线高度重叠，从而难以清楚地区分发现的亚型。在这里，我们试图通过稳健的聚类管道来实现亚组之间的更大分离。

方法

我们提出了一个稳健的框架来实现发现的亚型之间更好的分离。我们的框架基于使用 t 分布随机邻居嵌入（t-SNE）和基于稳健高斯混合模型的聚类方法对加权基因表达矩阵进行降维。在降维之前，根据基因的中位数绝对偏差（MAD）对每个基因进行加权。通过测量 KM 图之间的最小成对分离和亚型之间的最小风险比来量化结果。我们还引入了一种新的方法，称为累积生存分离，用于量化发现的亚型之间的分离。

结果

为了验证所提出的方法，我们从癌症基因组图谱（TCGA）中获得了五个癌症基因表达数据集，并与共识聚类（CC）、共识非负矩阵分解（CNMF）、快速密度感知谱聚类（Spectrum）和基于邻域的多组学聚类（NEMO）方法进行比较，结果表明，与文献中的上述方法相比，所提出的方法能够实现更大的分离。例如，对于 GBM，发现的亚型之间的最小成对预期寿命差异（以天为单位）对于所提出的方法使用 MAD 分数为 61 天，而仅对于 CC、Spectrum、Nemo 和 CNMF 分别为约 33、19、49 和 33 天。还展示了使用和不使用 MAD 分数的建议框架之间的比较，并且观察到 MAD 分数显著提高了亚型分离。风险比分析也表明，所提出的方法表现更好。此外，还进行了途径过度表达分析，以确定可能作为治疗靶点的相关遗传途径。