Department of Information Systems and Analytics, School of Computing, National University of Singapore, 117418 Singapore.
TCS Innovation Labs, Kolkata 700156, India.
Bioinformatics. 2020 Jan 15;36(2):621-628. doi: 10.1093/bioinformatics/btz599.
The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance.
In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients.
An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/.
Supplementary data are available at Bioinformatics online.
识别具有相似特征的患者亚群,称为患者分型,对于实现精准医学的目标非常重要。准确的分型对于定制治疗策略至关重要,这些策略有可能降低死亡率和发病率。基于模型的聚类,如高斯混合模型,提供了一种广泛用于识别亚类的有原则且可解释的方法。然而,它们对每个变量施加相同的边缘分布;这种假设限制了它们的建模灵活性并降低了聚类性能。
在本文中,我们使用 Copula 的统计框架来解耦边缘建模和它们之间的依赖性。由于参数推断方面的挑战,当前基于 Copula 的方法无法扩展到高维。我们开发了 HD-GMCM,它解决了这些挑战,并且据我们所知,是第一个能够拟合高维数据的基于 Copula 的聚类方法。我们在真实的高维基因表达和临床数据集上的实验表明,HD-GMCM 通过对非高斯数据进行建模以及通过使用高斯混合 Copula 对异常值进行稳健处理,优于最先进的基于模型的聚类方法。我们在 TCGA 的肺癌数据上进行了案例研究。从 HD-GMCM 获得的聚类可以根据它们所建模的依赖关系进行解释,这为描述亚类提供了一种新方法。从经验上看,这种建模不仅揭示了导致更好聚类的潜在结构,而且还揭示了患者生存率方面的有意义的临床亚类。
HD-GMCM 的 R 实现可在 https://bitbucket.org/cdal/hdgmcm/ 获得。
补充数据可在 Bioinformatics 在线获得。