Mathematical Institute, University of Oxford, Oxford OX2 6GG, United Kingdom.
The Alan Turing Institute, British Library, London NW1 2DB, United Kingdom.
Proc Natl Acad Sci U S A. 2020 Aug 18;117(33):19664-19669. doi: 10.1073/pnas.2001741117. Epub 2020 Aug 3.
The quest for low-dimensional models which approximate high-dimensional data is pervasive across the physical, natural, and social sciences. The dominant paradigm underlying most standard modeling techniques assumes that the data are concentrated near a single unknown manifold of relatively small intrinsic dimension. Here, we present a systematic framework for detecting interfaces and related anomalies in data which may fail to satisfy the manifold hypothesis. By computing the local topology of small regions around each data point, we are able to partition a given dataset into disjoint classes, each of which can be individually approximated by a single manifold. Since these manifolds may have different intrinsic dimensions, local topology discovers singular regions in data even when none of the points have been sampled precisely from the singularities. We showcase this method by identifying the intersection of two surfaces in the 24-dimensional space of cyclo-octane conformations and by locating all of the self-intersections of a Henneberg minimal surface immersed in 3-dimensional space. Due to the local nature of the topological computations, the algorithmic burden of performing such data stratification is readily distributable across several processors.
追求逼近高维数据的低维模型在物理、自然和社会科学中无处不在。大多数标准建模技术所基于的主导范例假设数据集中在一个相对较小内在维度的未知流形附近。在这里,我们提出了一个系统的框架,用于检测数据中的界面和相关异常,这些数据可能不符合流形假设。通过计算每个数据点周围小区域的局部拓扑结构,我们能够将给定的数据集划分为不相交的类,每个类都可以通过单个流形来单独近似。由于这些流形可能具有不同的内在维度,因此即使没有任何点从奇点处精确采样,局部拓扑也能发现数据中的奇异区域。我们通过识别环己烷构象的 24 维空间中的两个曲面的交点,并定位沉浸在 3 维空间中的 Henneberg 最小曲面的所有自交点,展示了这种方法。由于拓扑计算的局部性质,执行这种数据分层的算法负担很容易分布在多个处理器上。