Analytics and Data Science, Kennesaw State University, Kennesaw, USA.
Department of Computer Science, Kennesaw State University, Marietta, USA.
Methods. 2020 Feb 15;173:24-31. doi: 10.1016/j.ymeth.2019.06.017. Epub 2019 Jun 25.
Cancer is a genetic disease comprising multiple subtypes that have distinct molecular characteristics and clinical features. Cancer subtyping helps in improving personalized treatment and making decision, as different cancer subtypes respond differently to the treatment. The increasing availability of cancer related genomic data provides the opportunity to identify molecular subtypes. Several unsupervised machine learning techniques have been applied on molecular data of the tumor samples to identify cancer subtypes that are genetically and clinically distinct. However, most clustering methods often fail to efficiently cluster patients due to the challenges imposed by high-throughput genomic data and its non-linearity. In this paper, we propose a pathway-based deep clustering method (PACL) for molecular subtyping of cancer, which incorporates gene expression and biological pathway database to group patients into cancer subtypes. The main contribution of our model is to discover high-level representations of biological data by learning complex hierarchical and nonlinear effects of pathways. We compared the performance of our model with a number of benchmark clustering methods that recently have been proposed in cancer subtypes. We assessed the hypothesis that clusters (subtypes) may be associated to different survivals by logrank tests. PACL showed the lowest p-value of the logrank test against the benchmark methods. It demonstrates the patient groups clustered by PACL may correspond to subtypes which are significantly associated with distinct survival distributions. Moreover, PACL provides a solution to comprehensively identify subtypes and interpret the model in the biological pathway level. The open-source software of PACL in PyTorch is publicly available at https://github.com/tmallava/PACL.
癌症是一种遗传疾病,包含多个具有不同分子特征和临床特征的亚型。癌症分型有助于改善个性化治疗和决策,因为不同的癌症亚型对治疗的反应不同。越来越多的癌症相关基因组数据为识别分子亚型提供了机会。已经应用了几种无监督机器学习技术对肿瘤样本的分子数据进行分析,以识别在遗传和临床上不同的癌症亚型。然而,由于高通量基因组数据及其非线性带来的挑战,大多数聚类方法往往无法有效地对患者进行聚类。在本文中,我们提出了一种基于通路的深度学习聚类方法(PACL),用于癌症的分子分型,该方法将基因表达和生物通路数据库相结合,将患者分为癌症亚型。我们模型的主要贡献是通过学习通路的复杂层次和非线性效应,发现生物数据的高级表示。我们将我们的模型与最近在癌症亚型中提出的一些基准聚类方法进行了性能比较。我们评估了这样一个假设,即聚类(亚型)可能与不同的存活率相关,通过对数秩检验进行检验。PACL 显示出对数秩检验中针对基准方法的最低 p 值。这表明通过 PACL 聚类的患者组可能对应于与不同生存分布显著相关的亚型。此外,PACL 提供了一种全面识别亚型并在生物通路层面解释模型的解决方案。PACL 的 PyTorch 开源软件可在 https://github.com/tmallava/PACL 上获得。