• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

将数据集中的实例硬度与分类性能相关联:一种可视化方法。

Relating instance hardness to classification performance in a dataset: a visual approach.

作者信息

Paiva Pedro Yuri Arbs, Moreno Camila Castro, Smith-Miles Kate, Valeriano Maria Gabriela, Lorena Ana Carolina

机构信息

Instituto Tecnológico de Aeronáutica (ITA), São José dos Campos, São Paulo Brazil.

Universidade Federal de São Paulo (Unifesp), São José dos Campos, São Paulo Brazil.

出版信息

Mach Learn. 2022;111(8):3085-3123. doi: 10.1007/s10994-022-06205-9. Epub 2022 Jun 22.

DOI:10.1007/s10994-022-06205-9
PMID:35761958
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9217125/
Abstract

Machine Learning studies often involve a series of computational experiments in which the predictive performance of multiple models are compared across one or more datasets. The results obtained are usually summarized through average statistics, either in numeric tables or simple plots. Such approaches fail to reveal interesting subtleties about algorithmic performance, including which observations an algorithm may find easy or hard to classify, and also which observations within a dataset may present unique challenges. Recently, a methodology known as Instance Space Analysis was proposed for visualizing algorithm performance across different datasets. This methodology relates predictive performance to estimated instance hardness measures extracted from the datasets. However, the analysis considered an instance as being an entire classification dataset and the algorithm performance was reported for each dataset as an average error across all observations in the dataset. In this paper, we developed a more fine-grained analysis by adapting the ISA methodology. The adapted version of ISA allows the analysis of an individual classification dataset by a 2-D hardness embedding, which provides a visualization of the data according to the difficulty level of its individual observations. This allows deeper analyses of the relationships between instance hardness and predictive performance of classifiers. We also provide an open-access Python package named PyHard, which encapsulates the adapted ISA and provides an interactive visualization interface. We illustrate through case studies how our tool can provide insights about data quality and algorithm performance in the presence of challenges such as noisy and biased data.

摘要

机器学习研究通常涉及一系列计算实验,在这些实验中,要在一个或多个数据集上比较多个模型的预测性能。所获得的结果通常通过平均统计量进行总结,以数字表格或简单图表的形式呈现。这些方法无法揭示有关算法性能的有趣细微差别,包括算法可能认为容易或难以分类的观察结果,以及数据集中哪些观察结果可能带来独特的挑战。最近,一种称为实例空间分析的方法被提出来用于可视化不同数据集上的算法性能。这种方法将预测性能与从数据集中提取的估计实例难度度量联系起来。然而,该分析将一个实例视为整个分类数据集,并且将每个数据集的算法性能报告为数据集中所有观察结果的平均误差。在本文中,我们通过改编实例空间分析(ISA)方法开发了一种更细粒度的分析方法。改编后的ISA版本允许通过二维难度嵌入对单个分类数据集进行分析,它根据单个观察结果的难度级别对数据进行可视化。这使得能够更深入地分析实例难度与分类器预测性能之间的关系。我们还提供了一个名为PyHard的开放获取Python包,它封装了改编后的ISA并提供了一个交互式可视化界面。我们通过案例研究说明我们的工具如何在存在噪声和有偏差数据等挑战的情况下提供有关数据质量和算法性能的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/56a39e32dd65/10994_2022_6205_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/d82b63a228d1/10994_2022_6205_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/885336eb4d13/10994_2022_6205_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/9831f1a59fd8/10994_2022_6205_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/edad28bcc4bf/10994_2022_6205_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/f9ff6984e153/10994_2022_6205_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/dbbdcb767081/10994_2022_6205_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/c8e6d78fc1b4/10994_2022_6205_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/59a132a31670/10994_2022_6205_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/43b4762b9100/10994_2022_6205_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/3c5333956f9b/10994_2022_6205_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/b370fc774240/10994_2022_6205_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/aae1c7d215b3/10994_2022_6205_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/08647f101c91/10994_2022_6205_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/387d8d15f1c1/10994_2022_6205_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/56a39e32dd65/10994_2022_6205_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/d82b63a228d1/10994_2022_6205_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/885336eb4d13/10994_2022_6205_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/9831f1a59fd8/10994_2022_6205_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/edad28bcc4bf/10994_2022_6205_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/f9ff6984e153/10994_2022_6205_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/dbbdcb767081/10994_2022_6205_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/c8e6d78fc1b4/10994_2022_6205_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/59a132a31670/10994_2022_6205_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/43b4762b9100/10994_2022_6205_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/3c5333956f9b/10994_2022_6205_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/b370fc774240/10994_2022_6205_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/aae1c7d215b3/10994_2022_6205_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/08647f101c91/10994_2022_6205_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/387d8d15f1c1/10994_2022_6205_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32a9/9217125/56a39e32dd65/10994_2022_6205_Fig15_HTML.jpg

相似文献

1
Relating instance hardness to classification performance in a dataset: a visual approach.将数据集中的实例硬度与分类性能相关联:一种可视化方法。
Mach Learn. 2022;111(8):3085-3123. doi: 10.1007/s10994-022-06205-9. Epub 2022 Jun 22.
2
Benchmarking Analysis of the Accuracy of Classification Methods Related to Entropy.与熵相关的分类方法准确性的基准分析
Entropy (Basel). 2021 Jul 1;23(7):850. doi: 10.3390/e23070850.
3
UD-MIL: Uncertainty-Driven Deep Multiple Instance Learning for OCT Image Classification.UD-MIL:基于不确定性驱动的深度多重实例学习的 OCT 图像分类。
IEEE J Biomed Health Inform. 2020 Dec;24(12):3431-3442. doi: 10.1109/JBHI.2020.2983730. Epub 2020 Dec 4.
4
On mining incomplete medical datasets: Ordering imputation and classification.关于挖掘不完整医学数据集:排序插补与分类。
Technol Health Care. 2015;23(5):619-25. doi: 10.3233/THC-151018.
5
Mixture classification model based on clinical markers for breast cancer prognosis.基于临床标志物的乳腺癌预后混合分类模型。
Artif Intell Med. 2010 Feb-Mar;48(2-3):129-37. doi: 10.1016/j.artmed.2009.07.008. Epub 2009 Dec 14.
6
Revisiting Facial Age Estimation With New Insights From Instance Space Analysis.从实例空间分析中获得新的见解,重新审视面部年龄估计。
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2689-2697. doi: 10.1109/TPAMI.2020.3038760. Epub 2022 Apr 1.
7
An efficient data preprocessing approach for large scale medical data mining.一种用于大规模医学数据挖掘的高效数据预处理方法。
Technol Health Care. 2015;23(2):153-60. doi: 10.3233/THC-140887.
8
Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study.无编码经验的医疗保健专业人员进行医学图像分类的自动化深度学习设计:一项可行性研究。
Lancet Digit Health. 2019 Sep;1(5):e232-e242. doi: 10.1016/S2589-7500(19)30108-6. Epub 2019 Sep 5.
9
Persistent homology classification algorithm.持久同调分类算法。
PeerJ Comput Sci. 2023 Jan 10;9:e1195. doi: 10.7717/peerj-cs.1195. eCollection 2023.
10
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

引用本文的文献

1
Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning.超越规模和类别平衡:Alpha作为深度学习的新数据集质量指标
ArXiv. 2024 Jul 31:arXiv:2407.15724v2.
2
Measuring the prediction difficulty of individual cases in a dataset using machine learning.使用机器学习测量数据集中单个病例的预测难度。
Sci Rep. 2024 May 7;14(1):10474. doi: 10.1038/s41598-024-61284-z.