通过最小邻域信息估计数据集的内在维度。

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.

作者信息

Facco Elena, d'Errico Maria, Rodriguez Alex, Laio Alessandro

机构信息

SISSA International School for Advanced studies, department of Molecular and Statistical Biophysics, Trieste, 34136, Italy.

出版信息

Sci Rep. 2017 Sep 22;7(1):12140. doi: 10.1038/s41598-017-11873-y.

DOI:10.1038/s41598-017-11873-y

PMID:28939866

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5610237/

Abstract

Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.

摘要

分析大量高维数据是数据科学、分子模拟及其他领域中一个至关重要的问题。有几种方法基于这样的假设开展工作：数据集的重要内容属于一个内在维度（ID）远低于大量原始坐标数量的流形。这种流形通常是扭曲和弯曲的；此外，其上的点将呈非均匀分布：这两个因素使得识别ID及其利用变得非常困难。在此，我们提出一种仅使用样本中每个点的第一和第二近邻距离的新ID估计器。这种极端的极简性使我们能够减少曲率、密度变化的影响以及由此产生的计算成本。该ID估计器在均匀分布的数据集中理论上是精确的，并且总体上提供一致的度量。当与块分析结合使用时，它允许根据块大小区分相关维度。这使得即使数据位于受高维噪声干扰的流形上（这是现实世界数据集中经常遇到的情况）也能够估计ID。我们在分子模拟和图像分析中证明了该方法的实用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e24/5610237/b0986c9f43a8/41598_2017_11873_Fig1_HTML.jpg

相似文献

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.

Sci Rep. 2017 Sep 22;7(1):12140. doi: 10.1038/s41598-017-11873-y.

Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets.

Sci Rep. 2016 Aug 11;6:31377. doi: 10.1038/srep31377.

Computing the Free Energy without Collective Variables.

J Chem Theory Comput. 2018 Mar 13;14(3):1206-1215. doi: 10.1021/acs.jctc.7b00916. Epub 2018 Feb 19.

Manifold-adaptive dimension estimation revisited.

PeerJ Comput Sci. 2022 Jan 6;8:e790. doi: 10.7717/peerj-cs.790. eCollection 2022.

Computing the Riemannian curvature of image patch and single-cell RNA sequencing data manifolds using extrinsic differential geometry.

Proc Natl Acad Sci U S A. 2021 Jul 20;118(29). doi: 10.1073/pnas.2100473118.

Riemannian manifold learning.

IEEE Trans Pattern Anal Mach Intell. 2008 May;30(5):796-809. doi: 10.1109/TPAMI.2007.70735.

The generalized ratios intrinsic dimension estimator.

Sci Rep. 2022 Nov 21;12(1):20005. doi: 10.1038/s41598-022-20991-1.

Intrinsic dimension estimation for locally undersampled data.

Sci Rep. 2019 Nov 20;9(1):17133. doi: 10.1038/s41598-019-53549-9.

An axiomatic approach to intrinsic dimension of a dataset.

Neural Netw. 2008 Mar-Apr;21(2-3):204-13. doi: 10.1016/j.neunet.2007.12.030. Epub 2007 Dec 27.

Manifold Alignment Aware Ants: A Markovian Process for Manifold Extraction.

Neural Comput. 2022 Feb 17;34(3):595-641. doi: 10.1162/neco_a_01478.

引用本文的文献

The intrinsic dimension of gene expression during cell differentiation.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf805.

The topology of molecular representations and its influence on machine learning performance.

J Cheminform. 2025 Jul 21;17(1):109. doi: 10.1186/s13321-025-01045-w.

Relevant, Hidden, and Frustrated Information in High-Dimensional Analyses of Complex Dynamical Systems with Internal Noise.

J Chem Theory Comput. 2025 Jul 22;21(14):6683-6697. doi: 10.1021/acs.jctc.5c00374. Epub 2025 Jul 2.

The Intrinsic Dimension of Neural Network Ensembles.

Entropy (Basel). 2025 Apr 18;27(4):440. doi: 10.3390/e27040440.

miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources.

Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf150.

Robust estimation of the intrinsic dimension of data sets with quantum cognition machine learning.

Sci Rep. 2025 Feb 26;15(1):6933. doi: 10.1038/s41598-025-91676-8.

Dynamical Systems on Generalised Klein Bottles.

Entropy (Basel). 2025 Jan 24;27(2):119. doi: 10.3390/e27020119.

Simplicity within biological complexity.

Bioinform Adv. 2025 Feb 6;5(1):vbae164. doi: 10.1093/bioadv/vbae164. eCollection 2025.

Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance.

Nat Commun. 2025 Jan 2;16(1):270. doi: 10.1038/s41467-024-55449-7.

Estimating network dimension when the spectrum struggles.

R Soc Open Sci. 2024 May 22;11(5):230898. doi: 10.1098/rsos.230898. eCollection 2024 May.

本文引用的文献

Predicting the Kinetics of RNA Oligonucleotides Using Markov State Models.

J Chem Theory Comput. 2017 Feb 14;13(2):926-934. doi: 10.1021/acs.jctc.6b00982. Epub 2017 Jan 5.

Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets.

Sci Rep. 2016 Aug 11;6:31377. doi: 10.1038/srep31377.

Scalable Nearest Neighbor Algorithms for High Dimensional Data.

IEEE Trans Pattern Anal Mach Intell. 2014 Nov;36(11):2227-40. doi: 10.1109/TPAMI.2014.2321376.

GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit.

Bioinformatics. 2013 Apr 1;29(7):845-54. doi: 10.1093/bioinformatics/btt055. Epub 2013 Feb 13.

Using sketch-map coordinates to analyze and bias molecular dynamics simulations.

Proc Natl Acad Sci U S A. 2012 Apr 3;109(14):5196-201. doi: 10.1073/pnas.1201152109. Epub 2012 Mar 16.

Advillin folding takes place on a hypersurface of small dimensionality.

Phys Rev Lett. 2008 Nov 14;101(20):208101. doi: 10.1103/PhysRevLett.101.208101. Epub 2008 Nov 10.

Nonlinear dimensionality reduction by locally linear embedding.

Science. 2000 Dec 22;290(5500):2323-6. doi: 10.1126/science.290.5500.2323.

A global geometric framework for nonlinear dimensionality reduction.

Science. 2000 Dec 22;290(5500):2319-23. doi: 10.1126/science.290.5500.2319.

Separation of a mixture of independent signals using time delayed correlations.

Phys Rev Lett. 1994 Jun 6;72(23):3634-3637. doi: 10.1103/PhysRevLett.72.3634.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过最小邻域信息估计数据集的内在维度。

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献