Wadhwa Raoul R, Williamson Drew F K, Dhawan Andrew, Scott Jacob G
Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA.
Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA.
J Open Source Softw. 2018 Aug;3(28). doi: 10.21105/joss.00860. Epub 2018 Aug 8.
High-dimensional datasets are becoming more common in a variety of scientific fields. Well-known examples include next-generation sequencing in biology, patient health status in medicine, and computer vision in deep learning. Dimension reduction, using methods like principal component analysis (PCA), is a common preprocessing step for such datasets. However, while dimension reduction can save computing and human resources, it comes with the cost of significant information loss. Topological data analysis (TDA) aims to analyze the "shape" of high-dimensional datasets, without dimension reduction, by extracting features that are robust to small perturbations in data. Persistent features of a dataset can be used to describe it, and to compare it to other datasets. Visualization of persistent features can be done using topological barcodes or persistence diagrams (Figure 1). Application of TDA methods has granted greater insight into high-dimensional data (Lakshmikanth et al., 2017); one prominent example of this is its use to characterize a clinically relevant subgroup of breast cancer patients (Nicolau, Levine, & Carlsson, 2011). This is a particularly salient study as Nicolau et al. (2011) used a topological method, termed Progression Analysis of Disease, to identify a patient subgroup with 100% survival using that remains invisible to other clustering methods.
高维数据集在各种科学领域中越来越常见。著名的例子包括生物学中的下一代测序、医学中的患者健康状况以及深度学习中的计算机视觉。使用主成分分析(PCA)等方法进行降维是此类数据集常见的预处理步骤。然而,虽然降维可以节省计算和人力资源,但它伴随着大量信息丢失的代价。拓扑数据分析(TDA)旨在通过提取对数据中的小扰动具有鲁棒性的特征,在不降维的情况下分析高维数据集的“形状”。数据集的持久特征可用于描述它,并将其与其他数据集进行比较。持久特征的可视化可以使用拓扑条形码或持久图来完成(图1)。TDA方法的应用使人们对高维数据有了更深入的了解(Lakshmikanth等人,2017年);一个突出的例子是它用于表征乳腺癌患者的一个临床相关亚组(Nicolau、Levine和Carlsson,2011年)。这是一项特别显著的研究,因为Nicolau等人(2011年)使用了一种称为疾病进展分析的拓扑方法,来识别一个生存率为100%的患者亚组,而其他聚类方法对此却视而不见。