Noshad Morteza, Choi Jerome, Sun Yuming, Hero Alfred, Dinov Ivo D
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA.
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305 USA.
J Big Data. 2021;8(1):82. doi: 10.1186/s40537-021-00446-6. Epub 2021 Jun 5.
Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.
数据驱动的创新受到近期科学进展、快速的技术进步、制造成本的大幅降低以及对有效决策支持系统的大量需求的推动。这导致了人们努力收集大量异构和多源数据,然而,并非所有数据都具有同等质量或同等信息量。以前捕获和量化数据效用的方法包括信息价值(VoI)、信息质量(QoI)和互信息(MI)。本手稿引入了一种新的度量方法,用于量化越来越大量且日益复杂的数据相对于特定任务是否增强、降低或改变了其信息内容和效用。我们提出了一种新的信息论度量方法,称为数据价值度量(DVM),它可以量化大型异构数据集的有用信息内容(能量)。DVM公式基于一个正则化模型,该模型平衡了数据分析价值(效用)和模型复杂性。DVM可用于确定在特定应用领域中附加、扩展或扩充数据集是否有益。根据用于询问数据的数据分析、推理或预测技术的选择,DVM量化与增加数据大小或扩展其特征丰富度相关的信息增强或退化。DVM被定义为保真度项和正则化项的混合。保真度项具体在推理任务的背景下捕获样本数据的有用性。正则化项表示相应推理方法的计算复杂性。受深度学习中信息瓶颈概念的启发,保真度项取决于相应监督或无监督模型的性能。我们针对几种替代的监督和无监督回归、分类、聚类和降维任务测试了DVM方法。实验验证中使用了具有弱信号信息和强信号信息的真实和模拟数据集。我们的研究结果表明,DVM有效地捕获了分析价值和算法复杂性之间的平衡。DVM的变化揭示了在数据集的样本大小和特征丰富度方面算法复杂性和数据分析价值之间的权衡。DVM值可用于确定数据的大小和特征,以优化各种监督或无监督算法的相对效用。