• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于量化信息内容和效用的数据值指标。

A data value metric for quantifying information content and utility.

作者信息

Noshad Morteza, Choi Jerome, Sun Yuming, Hero Alfred, Dinov Ivo D

机构信息

Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA.

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305 USA.

出版信息

J Big Data. 2021;8(1):82. doi: 10.1186/s40537-021-00446-6. Epub 2021 Jun 5.

DOI:10.1186/s40537-021-00446-6
PMID:34777945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8550565/
Abstract

Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.

摘要

数据驱动的创新受到近期科学进展、快速的技术进步、制造成本的大幅降低以及对有效决策支持系统的大量需求的推动。这导致了人们努力收集大量异构和多源数据,然而,并非所有数据都具有同等质量或同等信息量。以前捕获和量化数据效用的方法包括信息价值(VoI)、信息质量(QoI)和互信息(MI)。本手稿引入了一种新的度量方法,用于量化越来越大量且日益复杂的数据相对于特定任务是否增强、降低或改变了其信息内容和效用。我们提出了一种新的信息论度量方法,称为数据价值度量(DVM),它可以量化大型异构数据集的有用信息内容(能量)。DVM公式基于一个正则化模型,该模型平衡了数据分析价值(效用)和模型复杂性。DVM可用于确定在特定应用领域中附加、扩展或扩充数据集是否有益。根据用于询问数据的数据分析、推理或预测技术的选择,DVM量化与增加数据大小或扩展其特征丰富度相关的信息增强或退化。DVM被定义为保真度项和正则化项的混合。保真度项具体在推理任务的背景下捕获样本数据的有用性。正则化项表示相应推理方法的计算复杂性。受深度学习中信息瓶颈概念的启发,保真度项取决于相应监督或无监督模型的性能。我们针对几种替代的监督和无监督回归、分类、聚类和降维任务测试了DVM方法。实验验证中使用了具有弱信号信息和强信号信息的真实和模拟数据集。我们的研究结果表明,DVM有效地捕获了分析价值和算法复杂性之间的平衡。DVM的变化揭示了在数据集的样本大小和特征丰富度方面算法复杂性和数据分析价值之间的权衡。DVM值可用于确定数据的大小和特征,以优化各种监督或无监督算法的相对效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/302e29560b25/40537_2021_446_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/b4d5ffc36244/40537_2021_446_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/3b0182859d7a/40537_2021_446_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/61668d4a127f/40537_2021_446_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/22b838c62a7e/40537_2021_446_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/c7d3566178fb/40537_2021_446_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/49d7be0ae273/40537_2021_446_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/3d3af8154b93/40537_2021_446_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/302e29560b25/40537_2021_446_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/b4d5ffc36244/40537_2021_446_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/3b0182859d7a/40537_2021_446_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/61668d4a127f/40537_2021_446_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/22b838c62a7e/40537_2021_446_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/c7d3566178fb/40537_2021_446_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/49d7be0ae273/40537_2021_446_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/3d3af8154b93/40537_2021_446_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58b6/8550565/302e29560b25/40537_2021_446_Fig8_HTML.jpg

相似文献

1
A data value metric for quantifying information content and utility.一种用于量化信息内容和效用的数据值指标。
J Big Data. 2021;8(1):82. doi: 10.1186/s40537-021-00446-6. Epub 2021 Jun 5.
2
Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学:基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍
3
Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning.无监督特征学习极大地提高了鸟类声音的自动大规模分类。
PeerJ. 2014 Jul 17;2:e488. doi: 10.7717/peerj.488. eCollection 2014.
4
Semi-supervised and unsupervised extreme learning machines.半监督和无监督极限学习机。
IEEE Trans Cybern. 2014 Dec;44(12):2405-17. doi: 10.1109/TCYB.2014.2307349.
5
Information-theoretic semi-supervised metric learning via entropy regularization.通过熵正则化的信息论半监督度量学习
Neural Comput. 2014 Aug;26(8):1717-62. doi: 10.1162/NECO_a_00614. Epub 2014 May 30.
6
A trace ratio maximization approach to multiple kernel-based dimensionality reduction.基于迹比最大化的多核维度约减方法。
Neural Netw. 2014 Jan;49:96-106. doi: 10.1016/j.neunet.2013.09.004. Epub 2013 Oct 9.
7
Partially supervised speaker clustering.部分监督的说话人聚类。
IEEE Trans Pattern Anal Mach Intell. 2012 May;34(5):959-71. doi: 10.1109/TPAMI.2011.174.
8
Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets.针对高度非线性的生物学、生物医学及通用数据集的监督学习预测模型的开发。
Front Mol Biosci. 2020 Feb 13;7:13. doi: 10.3389/fmolb.2020.00013. eCollection 2020.
9
Vicinal support vector classifier using supervised kernel-based clustering.基于监督核聚类的邻接支持向量分类器。
Artif Intell Med. 2014 Mar;60(3):189-96. doi: 10.1016/j.artmed.2014.01.003. Epub 2014 Feb 7.
10
A Memory-Efficient Encoding Method for Processing Mixed-Type Data on Machine Learning.一种用于在机器学习中处理混合类型数据的内存高效编码方法。
Entropy (Basel). 2020 Dec 9;22(12):1391. doi: 10.3390/e22121391.

引用本文的文献

1
Robust identification key predictors of short- and long-term weight status in children and adolescents by machine learning.机器学习识别儿童和青少年短期和长期体重状况的关键预测因子。
Front Public Health. 2024 Sep 24;12:1414046. doi: 10.3389/fpubh.2024.1414046. eCollection 2024.

本文引用的文献

1
Value of Information: Sensitivity Analysis and Research Design in Bayesian Evidence Synthesis.信息价值:贝叶斯证据综合中的敏感性分析与研究设计
J Am Stat Assoc. 2019 Apr 30;114(528):1436-1449. doi: 10.1080/01621459.2018.1562932. eCollection 2019.
2
Model-Based and Model-Free Techniques for Amyotrophic Lateral Sclerosis Diagnostic Prediction and Patient Clustering.基于模型和无模型技术在肌萎缩侧索硬化症诊断预测和患者聚类中的应用。
Neuroinformatics. 2019 Jul;17(3):407-421. doi: 10.1007/s12021-018-9406-9.
3
Transplantation of spinal cord-derived neural stem cells for ALS: Analysis of phase 1 and 2 trials.
脊髓源性神经干细胞移植治疗肌萎缩侧索硬化症:1期和2期试验分析
Neurology. 2016 Jul 26;87(4):392-400. doi: 10.1212/WNL.0000000000002889. Epub 2016 Jun 29.
4
Estimating the expected value of partial perfect information in health economic evaluations using integrated nested Laplace approximation.使用集成嵌套拉普拉斯近似法估计健康经济评估中部分完美信息的期望值。
Stat Med. 2016 Oct 15;35(23):4264-80. doi: 10.1002/sim.6983. Epub 2016 May 18.
5
Empirically Estimable Classification Bounds Based on a Nonparametric Divergence Measure.基于非参数散度测度的经验可估计分类边界
IEEE Trans Signal Process. 2016 Feb 1;64(3):580-591. doi: 10.1109/TSP.2015.2477805.
6
Probabilistic machine learning and artificial intelligence.概率机器学习和人工智能。
Nature. 2015 May 28;521(7553):452-9. doi: 10.1038/nature14541.
7
Estimating the Expected Value of Sample Information Using the Probabilistic Sensitivity Analysis Sample: A Fast, Nonparametric Regression-Based Method.使用概率敏感性分析样本估计样本信息的期望值:一种基于快速非参数回归的方法。
Med Decis Making. 2015 Jul;35(5):570-83. doi: 10.1177/0272989X15575286. Epub 2015 Mar 25.
8
Early reperfusion and clinical outcomes in patients with M2 occlusion: pooled analysis of the PROACT II, IMS, and IMS II studies.M2段闭塞患者的早期再灌注与临床结局:PROACT II、IMS及IMS II研究的汇总分析
J Neurosurg. 2014 Dec;121(6):1354-8. doi: 10.3171/2014.7.JNS131430. Epub 2014 Sep 26.
9
Strategies for efficient computation of the expected value of partial perfect information.部分完美信息期望值的有效计算策略。
Med Decis Making. 2014 Apr;34(3):327-42. doi: 10.1177/0272989X13514774. Epub 2014 Jan 21.
10
Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample: a nonparametric regression approach.从概率敏感性分析样本中估计完美信息的多参数部分预期值:一种非参数回归方法。
Med Decis Making. 2014 Apr;34(3):311-26. doi: 10.1177/0272989X13505910. Epub 2013 Nov 18.