Cukierski William J, Foran David J
Dept. of Biomedical Engineering, Rutgers University and the University of Medicine and Dentistry of New Jersey.
Proc IEEE Int Conf Data Min. 2008 Dec 1;2008:949-958. doi: 10.1109/ICDMW.2008.39.
High-dimensional data presents a challenge to tasks of pattern recognition and machine learning. Dimensionality reduction (DR) methods remove the unwanted variance and make these tasks tractable. Several nonlinear DR methods, such as the well known ISOMAP algorithm, rely on a neighborhood graph to compute geodesic distances between data points. These graphs can contain unwanted edges which connect disparate regions of one or more manifolds. This topological sensitivity is well known [1], [2], [3], yet handling high-dimensional, noisy data in the absence of a priori manifold knowledge, remains an open and difficult problem. This work introduces a divisive, edge-removal method based on graph betweenness centrality which can robustly identify manifold-shorting edges. The problem of graph construction in high dimension is discussed and the proposed algorithm is fit into the ISOMAP workflow. ROC analysis is performed and the performance is tested on synthetic and real datasets.
高维数据给模式识别和机器学习任务带来了挑战。降维(DR)方法消除了不必要的方差,使这些任务变得易于处理。几种非线性DR方法,如著名的ISOMAP算法,依赖邻域图来计算数据点之间的测地距离。这些图可能包含连接一个或多个流形不同区域的不必要边。这种拓扑敏感性是众所周知的[1]、[2]、[3],然而,在没有先验流形知识的情况下处理高维、有噪声的数据仍然是一个开放且困难的问题。这项工作引入了一种基于图中介中心性的分裂式边去除方法,该方法可以稳健地识别流形短路边。讨论了高维图构建问题,并将所提出的算法应用于ISOMAP工作流程中。进行了ROC分析,并在合成数据集和真实数据集上测试了性能。