Buch Amanda M, Liston Conor, Grosenick Logan
Dept. of Psychiatry & BMRI, Weill Cornell Medicine, Cornell University.
Proc Mach Learn Res. 2024 May;238:136-144.
AI-enabled precision medicine promises a transformational improvement in healthcare outcomes. However, training on biomedical data presents significant challenges as they are often high dimensional, clustered, and of limited sample size. To overcome these challenges, we propose a simple and scalable approach for cluster-aware embedding that combines latent factor methods with a convex clustering penalty in a modular way. Our novel approach overcomes the complexity and limitations of current joint embedding and clustering methods and enables hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through numerical experiments and real-world examples, we demonstrate that our approach outperforms fourteen clustering methods on highly underdetermined problems (e.g., with limited sample size) as well as on large sample datasets. Importantly, our approach does not require the user to choose the desired number of clusters, yields improved model selection if they do, and yields interpretable hierarchically clustered embedding dendrograms. Thus, our approach improves significantly on existing methods for identifying patient subgroups in multiomics and neuroimaging data and enables scalable and interpretable biomarkers for precision medicine.
人工智能驱动的精准医学有望显著改善医疗保健效果。然而,对生物医学数据进行训练存在重大挑战,因为这些数据通常具有高维度、聚类且样本量有限的特点。为了克服这些挑战,我们提出了一种简单且可扩展的聚类感知嵌入方法,该方法以模块化方式将潜在因子方法与凸聚类惩罚相结合。我们的新方法克服了当前联合嵌入和聚类方法的复杂性和局限性,并实现了分层聚类主成分分析(PCA)、局部线性嵌入(LLE)和典型相关分析(CCA)。通过数值实验和实际案例,我们证明了我们的方法在高度欠定问题(例如样本量有限)以及大样本数据集上优于十四种聚类方法。重要的是,我们的方法不需要用户选择所需的聚类数量,如果用户选择了聚类数量,它能改进模型选择,并生成可解释的分层聚类嵌入树状图。因此,我们的方法在识别多组学和神经影像数据中的患者亚组的现有方法上有显著改进,并能为精准医学提供可扩展且可解释的生物标志物。