Suppr超能文献

用于量化短时间和噪声数据之间依赖性的互信息估计方法的相对性能。

Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data.

作者信息

Khan Shiraj, Bandyopadhyay Sharba, Ganguly Auroop R, Saigal Sunil, Erickson David J, Protopopescu Vladimir, Ostrouchov George

机构信息

Computational Sciences and Engineering, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA.

出版信息

Phys Rev E Stat Nonlin Soft Matter Phys. 2007 Aug;76(2 Pt 2):026209. doi: 10.1103/PhysRevE.76.026209. Epub 2007 Aug 14.

Abstract

Commonly used dependence measures, such as linear correlation, cross-correlogram, or Kendall's tau , cannot capture the complete dependence structure in data unless the structure is restricted to linear, periodic, or monotonic. Mutual information (MI) has been frequently utilized for capturing the complete dependence structure including nonlinear dependence. Recently, several methods have been proposed for the MI estimation, such as kernel density estimators (KDEs), k -nearest neighbors (KNNs), Edgeworth approximation of differential entropy, and adaptive partitioning of the XY plane. However, outstanding gaps in the current literature have precluded the ability to effectively automate these methods, which, in turn, have caused limited adoptions by the application communities. This study attempts to address a key gap in the literature-specifically, the evaluation of the above methods to choose the best method, particularly in terms of their robustness for short and noisy data, based on comparisons with the theoretical MI estimates, which can be computed analytically, as well with linear correlation and Kendall's tau . Here we consider smaller data sizes, such as 50, 100, and 1000, and within this study we characterize 50 and 100 data points as very short and 1000 as short. We consider a broader class of functions, specifically linear, quadratic, periodic, and chaotic, contaminated with artificial noise with varying noise-to-signal ratios. Our results indicate KDEs as the best choice for very short data at relatively high noise-to-signal levels whereas the performance of KNNs is the best for very short data at relatively low noise levels as well as for short data consistently across noise levels. In addition, the optimal smoothing parameter of a Gaussian kernel appears to be the best choice for KDEs while three nearest neighbors appear optimal for KNNs. Thus, in situations where the approximate data sizes are known in advance and exploratory data analysis and/or domain knowledge can be used to provide a priori insights into the noise-to-signal ratios, the results in the paper point to a way forward for automating the process of MI estimation.

摘要

常用的相关性度量方法,如线性相关、互相关图或肯德尔秩相关系数,都无法捕捉数据中的完整相关性结构,除非该结构限于线性、周期性或单调性。互信息(MI)已被频繁用于捕捉包括非线性相关性在内的完整相关性结构。最近,已经提出了几种互信息估计方法,如核密度估计器(KDE)、k近邻(KNN)、微分熵的埃奇沃思近似以及XY平面的自适应划分。然而,当前文献中存在的显著差距使得无法有效地自动化这些方法,这反过来又导致应用社区对其采用有限。本研究试图解决文献中的一个关键差距,具体而言,通过与理论互信息估计值(可通过解析计算)以及线性相关和肯德尔秩相关系数进行比较,评估上述方法以选择最佳方法,特别是在它们对短数据和噪声数据的稳健性方面。在这里,我们考虑较小的数据规模,如50、100和1000,在本研究中,我们将50和100个数据点表征为非常短的数据,1000个数据点为短数据。我们考虑更广泛的函数类别,特别是线性、二次、周期性和混沌函数,并添加具有不同信噪比的人工噪声。我们的结果表明,在相对较高的信噪比下,核密度估计器是非常短数据的最佳选择,而k近邻在相对较低的噪声水平下对于非常短的数据以及在所有噪声水平下对于短数据的性能最佳。此外,高斯核的最优平滑参数似乎是核密度估计器的最佳选择,而三个最近邻对于k近邻似乎是最优的。因此,在预先知道近似数据规模并且可以使用探索性数据分析和/或领域知识来提供关于信噪比的先验见解的情况下,本文的结果指出了一种实现互信息估计过程自动化的方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验