School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA.
Department of Statistics, Rice University, Houston, Texas, USA.
Biometrics. 2023 Dec;79(4):3846-3858. doi: 10.1111/biom.13860. Epub 2023 Apr 12.
Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.
聚类长期以来一直是一种流行的无监督学习方法,用于在许多应用中识别相似对象的组,并从无标签数据中发现模式。然而,由于其无监督性质,通常很难对估计的聚类进行有意义的解释。同时,在许多实际场景中,存在一些有噪声的监督辅助变量,例如主观诊断意见,它们与未标记数据的观察到的异质性有关。通过利用监督辅助变量和未标记数据的信息,我们旨在发现更具科学解释性的组结构,这些结构可能被完全无监督的分析所隐藏。在这项工作中,我们提出并开发了一种名为监督凸聚类(SCC)的新统计模式发现方法,该方法从两个信息源中汲取力量,并通过联合凸融合惩罚来指导寻找更具可解释性的模式。我们开发了 SCC 的几个扩展版本,以整合不同类型的监督辅助变量,调整额外的协变量,并找到双聚类。我们通过模拟和阿尔茨海默病基因组学的案例研究展示了 SCC 的实际优势。具体来说,我们发现了新的候选基因和阿尔茨海默病的新亚型,这可能有助于更好地理解导致老年人认知能力下降的观察到的异质性的潜在遗传机制。