解除高维数据的诅咒：针对多种生物数据模式的自动投影寻踪聚类

Lifting the curse from high-dimensional data: automated projection pursuit clustering for a variety of biological data modalities.

作者信息

Simpson Claire, Tabatsky Evgeniy, Rahil Zainab, Eddins Devon J, Tkachev Sasha, Georgescauld Florian, Papalegis Derek, Culka Martin, Levy Tyler, Gregoretti Ivan, Meehan Connor, Schiller Chiara, Bestak Kresimir, Schapiro Denis, Chernyshev Andrei, Walther Guenther, Ghosn Eliver E B, Orlova Darya

机构信息

Cell Signaling Technology, Danvers, MA 01915, USA.

Independent researcher, Komsomolsk-on-Amur 681021, Russia.

出版信息

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf052.

DOI:10.1093/gigascience/giaf052

PMID:40440093

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12121483/

Abstract

Unsupervised clustering is a powerful machine-learning technique widely used to analyze high-dimensional biological data. It plays a crucial role in uncovering patterns, structures, and inherent relationships within complex datasets without relying on predefined labels. In the context of biology, high-dimensional data may include transcriptomics, proteomics, and a variety of single-cell omics data. Most existing clustering algorithms operate directly in the high-dimensional space, and their performance may be negatively affected by the phenomenon known as the curse of dimensionality. Here, we show an alternative clustering approach that alleviates the curse by sequentially projecting high-dimensional data into a low-dimensional representation. We validated the effectiveness of our approach, named automated projection pursuit (APP), across various biological data modalities, including flow and mass cytometry data, scRNA-seq, multiplex imaging data, and T-cell receptor repertoire data. APP efficiently recapitulated experimentally validated cell-type definitions and revealed new biologically meaningful patterns.

摘要

无监督聚类是一种强大的机器学习技术，广泛用于分析高维生物学数据。它在揭示复杂数据集中的模式、结构和内在关系方面发挥着关键作用，而无需依赖预定义的标签。在生物学背景下，高维数据可能包括转录组学、蛋白质组学以及各种单细胞组学数据。大多数现有的聚类算法直接在高维空间中运行，其性能可能会受到所谓的维度诅咒现象的负面影响。在这里，我们展示了一种替代的聚类方法，该方法通过将高维数据顺序投影到低维表示中来减轻维度诅咒。我们在各种生物学数据模式中验证了我们称为自动投影追踪（APP）的方法的有效性，包括流式细胞术和质谱细胞术数据、单细胞RNA测序、多重成像数据以及T细胞受体库数据。APP有效地概括了经过实验验证的细胞类型定义，并揭示了新的具有生物学意义的模式。