Wang Chao, Machiraju Raghu, Huang Kun
Department of Biomedical Informatics, The Ohio State University, United States; Department of Electrical and Computer Engineering, The Ohio State University, United States.
Department of Computer Science and Engineering, The Ohio State University, United States.
Methods. 2014 Jun 1;67(3):304-12. doi: 10.1016/j.ymeth.2014.03.005. Epub 2014 Mar 18.
Breast cancers are highly heterogeneous with different subtypes that lead to different clinical outcomes including prognosis, response to treatment and chances of recurrence and metastasis. An important task in personalized medicine is to determine the subtype for a breast cancer patient in order to provide the most effective treatment. In order to achieve this goal, integrative genomics approach has been developed recently with multiple modalities of large datasets ranging from genotypes to multiple levels of phenotypes. A major challenge in integrative genomics is how to effectively integrate multiple modalities of data to stratify the breast cancer patients. Consensus clustering algorithms have often been adopted for this purpose. However, existing consensus clustering algorithms are not suitable for the situation of integrating clustering results obtained from a mixture of numerical data and categorical data. In this work, we present a mathematical formulation for integrative clustering of multiple-source data including both numerical and categorical data to resolve the above issue. Specifically, we formulate the problem as a novel consensus clustering method called Molecular Regularized Consensus Patient Stratification (MRCPS) based on an optimization process with regularization. Unlike the traditional consensus clustering methods, MRCPS can automatically and spontaneously cluster both numerical and categorical data with any option of similarity metrics. We apply this new method by applying it on the TCGA breast cancer datasets and evaluate using both statistical criteria and clinical relevance on predicting prognosis. The result demonstrates the superiority of this method in terms of effectiveness of aggregation and differentiating patient outcomes. Our method, while motivated by the breast cancer research, is nevertheless universal for integrative genomics studies.
乳腺癌具有高度异质性,存在不同亚型,这些亚型会导致不同的临床结果,包括预后、对治疗的反应以及复发和转移的几率。精准医学中的一项重要任务是确定乳腺癌患者的亚型,以便提供最有效的治疗。为了实现这一目标,最近开发了整合基因组学方法,该方法整合了从基因型到多个表型水平的多种大型数据集。整合基因组学中的一个主要挑战是如何有效地整合多模态数据以对乳腺癌患者进行分层。为此,常常采用共识聚类算法。然而,现有的共识聚类算法不适用于整合从数值数据和分类数据混合中获得的聚类结果的情况。在这项工作中,我们提出了一种用于多源数据(包括数值数据和分类数据)整合聚类的数学公式,以解决上述问题。具体而言,我们将该问题表述为一种基于正则化优化过程的新型共识聚类方法,称为分子正则化共识患者分层(MRCPS)。与传统的共识聚类方法不同,MRCPS可以使用任何相似性度量选项自动且自发地对数值数据和分类数据进行聚类。我们将这种新方法应用于TCGA乳腺癌数据集,并使用统计标准和临床相关性来评估其对预后的预测能力。结果表明,该方法在聚合有效性和区分患者预后方面具有优越性。我们的方法虽然是受乳腺癌研究启发,但对于整合基因组学研究具有通用性。