Facco Elena, d'Errico Maria, Rodriguez Alex, Laio Alessandro
SISSA International School for Advanced studies, department of Molecular and Statistical Biophysics, Trieste, 34136, Italy.
Sci Rep. 2017 Sep 22;7(1):12140. doi: 10.1038/s41598-017-11873-y.
Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.
分析大量高维数据是数据科学、分子模拟及其他领域中一个至关重要的问题。有几种方法基于这样的假设开展工作:数据集的重要内容属于一个内在维度(ID)远低于大量原始坐标数量的流形。这种流形通常是扭曲和弯曲的;此外,其上的点将呈非均匀分布:这两个因素使得识别ID及其利用变得非常困难。在此,我们提出一种仅使用样本中每个点的第一和第二近邻距离的新ID估计器。这种极端的极简性使我们能够减少曲率、密度变化的影响以及由此产生的计算成本。该ID估计器在均匀分布的数据集中理论上是精确的,并且总体上提供一致的度量。当与块分析结合使用时,它允许根据块大小区分相关维度。这使得即使数据位于受高维噪声干扰的流形上(这是现实世界数据集中经常遇到的情况)也能够估计ID。我们在分子模拟和图像分析中证明了该方法的实用性。