DREAMER：一个用于评估数据集是否适用于机器学习的计算框架。

DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

机构信息

Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.

出版信息

BMC Med Inform Decis Mak. 2024 Jun 4;24(1):152. doi: 10.1186/s12911-024-02544-w.

DOI:10.1186/s12911-024-02544-w

PMID:38831432

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11149315/

Abstract

BACKGROUND

Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..

RESULTS

The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.

CONCLUSION

Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

摘要

背景

机器学习（ML）已成为分析跨多个领域的大规模数据集的主要计算范例。数据集质量评估是成功部署 ML 模型的关键前提。在这项研究中，我们引入了 DREAMER（用于机器学习研究的数据准备），这是一个利用监督和无监督机器学习技术的算法框架，能够自主评估表格数据集是否适合 ML 模型开发。DREAMER 可作为 GitHub 和 Docker 上的工具公开访问，便于在研究社区中采用和进一步改进。

结果

本研究中提出的模型应用于三个不同的表格数据集，通过使用既定的数据质量指标评估，这些数据集在准备用于 ML 任务方面的质量得到了显著提高。我们的研究结果表明，该框架通过消除多余的特征和行，极大地提高了原始数据集的质量，从而有效地增强了原始数据集的质量。这种改进提高了监督和无监督学习方法的准确性。

结论

我们的软件提供了一个自动化的数据准备框架，旨在提高原始数据集的完整性，以促进在 ML 管道中的稳健利用。通过我们提出的框架，我们对原始数据集进行了精简，从而提高了相关 ML 算法的准确性和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/32aa/11149315/2fb3e333af99/12911_2024_2544_Fig1_HTML.jpg

相似文献

DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

BMC Med Inform Decis Mak. 2024 Jun 4;24(1):152. doi: 10.1186/s12911-024-02544-w.

A comparative study of supervised and unsupervised machine learning algorithms applied to human microbiome.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data.

Gigascience. 2020 Apr 1;9(4). doi: 10.1093/gigascience/giaa026.

The Utility of Unsupervised Machine Learning in Anatomic Pathology.

Am J Clin Pathol. 2022 Jan 6;157(1):5-14. doi: 10.1093/ajcp/aqab085.

Feature selection and machine learning methods for optimal identification and prediction of subtypes in Parkinson's disease.

Comput Methods Programs Biomed. 2021 Jul;206:106131. doi: 10.1016/j.cmpb.2021.106131. Epub 2021 Apr 29.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle.

Prev Vet Med. 2020 Feb;175:104869. doi: 10.1016/j.prevetmed.2019.104869. Epub 2019 Dec 17.

A Comprehensive Review on Machine Learning in Healthcare Industry: Classification, Restrictions, Opportunities and Challenges.

Sensors (Basel). 2023 Apr 22;23(9):4178. doi: 10.3390/s23094178.

Identifying diseases symptoms and general rules using supervised and unsupervised machine learning.

Sci Rep. 2024 Aug 2;14(1):17956. doi: 10.1038/s41598-024-69029-8.

引用本文的文献

The cognitive impacts of large language model interactions on problem solving and decision making using EEG analysis.

Front Comput Neurosci. 2025 Jul 16;19:1556483. doi: 10.3389/fncom.2025.1556483. eCollection 2025.

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.

BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.

本文引用的文献

Technology readiness levels for machine learning systems.

Nat Commun. 2022 Oct 20;13(1):6039. doi: 10.1038/s41467-022-33128-9.

Shifting machine learning for healthcare from development to deployment and from models to data.

Nat Biomed Eng. 2022 Dec;6(12):1330-1345. doi: 10.1038/s41551-022-00898-y. Epub 2022 Jul 4.

Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review.

NPJ Digit Med. 2022 Jan 10;5(1):2. doi: 10.1038/s41746-021-00549-7.

Automated detection of poor-quality data: case studies in healthcare.

Sci Rep. 2021 Sep 9;11(1):18005. doi: 10.1038/s41598-021-97341-0.

Machine Learning: Algorithms, Real-World Applications and Research Directions.

SN Comput Sci. 2021;2(3):160. doi: 10.1007/s42979-021-00592-x. Epub 2021 Mar 22.

A scalable photonic computer solving the subset sum problem.

Sci Adv. 2020 Jan 31;6(5):eaay5853. doi: 10.1126/sciadv.aay5853. eCollection 2020 Jan.

A new dataset evaluation method based on category overlap.

Comput Biol Med. 2011 Feb;41(2):115-22. doi: 10.1016/j.compbiomed.2010.12.006. Epub 2011 Jan 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

DREAMER：一个用于评估数据集是否适用于机器学习的计算框架。

DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献