探索医疗保健领域中数据集大小和不平衡对卷积神经网络性能的相互作用：利用X光识别新冠肺炎患者。

Exploring the Interplay of Dataset Size and Imbalance on CNN Performance in Healthcare: Using X-rays to Identify COVID-19 Patients.

作者信息

Davidian Moshe, Lahav Adi, Joshua Ben-Zion, Wand Ori, Lurie Yotam, Mark Shlomo

机构信息

Guilford Glazer Faculty of Business and Management, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.

Software Engineering Department, SCE-Shamoon College of Engineering, Beer-Sheva 84100, Israel.

出版信息

Diagnostics (Basel). 2024 Aug 8;14(16):1727. doi: 10.3390/diagnostics14161727.

DOI:10.3390/diagnostics14161727

PMID:39202215

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11353409/

Abstract

INTRODUCTION

Convolutional Neural Network (CNN) systems in healthcare are influenced by unbalanced datasets and varying sizes. This article delves into the impact of dataset size, class imbalance, and their interplay on CNN systems, focusing on the size of the training set versus imbalance-a unique perspective compared to the prevailing literature. Furthermore, it addresses scenarios with more than two classification groups, often overlooked but prevalent in practical settings.

METHODS

Initially, a CNN was developed to classify lung diseases using X-ray images, distinguishing between healthy individuals and COVID-19 patients. Later, the model was expanded to include pneumonia patients. To evaluate performance, numerous experiments were conducted with varied data sizes and imbalance ratios for both binary and ternary classifications, measuring various indices to validate the model's efficacy.

RESULTS

The study revealed that increasing dataset size positively impacts CNN performance, but this improvement saturates beyond a certain size. A novel finding is that the data balance ratio influences performance more significantly than dataset size. The behavior of three-class classification mirrored that of binary classification, underscoring the importance of balanced datasets for accurate classification.

CONCLUSIONS

This study emphasizes the fact that achieving balanced representation in datasets is crucial for optimal CNN performance in healthcare, challenging the conventional focus on dataset size. Balanced datasets improve classification accuracy, both in two-class and three-class scenarios, highlighting the need for data-balancing techniques to improve model reliability and effectiveness.

MOTIVATION

Our study is motivated by a scenario with 100 patient samples, offering two options: a balanced dataset with 200 samples and an unbalanced dataset with 500 samples (400 healthy individuals). We aim to provide insights into the optimal choice based on the interplay between dataset size and imbalance, enriching the discourse for stakeholders interested in achieving optimal model performance.

LIMITATIONS

Recognizing a single model's generalizability limitations, we assert that further studies on diverse datasets are needed.

摘要

引言

医疗保健领域的卷积神经网络（CNN）系统受到不平衡数据集和不同规模的影响。本文深入探讨了数据集规模、类别不平衡及其相互作用对CNN系统的影响，重点关注训练集规模与不平衡之间的关系——这是一个与现有文献相比独特的视角。此外，它还探讨了具有两个以上分类组的情况，这种情况在实际应用中经常被忽视但却很普遍。

方法

最初，开发了一个CNN，用于使用X射线图像对肺部疾病进行分类，区分健康个体和新冠肺炎患者。后来，该模型扩展到包括肺炎患者。为了评估性能，针对二分类和三分类，使用不同的数据规模和不平衡率进行了大量实验，测量各种指标以验证模型的有效性。

结果

研究表明，增加数据集规模对CNN性能有积极影响，但这种改进在超过一定规模后会趋于饱和。一个新发现是，数据平衡率比数据集规模对性能的影响更大。三分类的表现与二分类相似，强调了平衡数据集对于准确分类的重要性。

结论

本研究强调了在数据集中实现平衡表示对于医疗保健领域中CNN的最佳性能至关重要，这对传统上对数据集规模的关注提出了挑战。平衡数据集在二分类和三分类场景中都提高了分类准确性，突出了需要数据平衡技术来提高模型的可靠性和有效性。

动机

我们的研究是受一个有100个患者样本的场景驱动的，提供了两个选项：一个有200个样本的平衡数据集和一个有500个样本（400个健康个体）的不平衡数据集。我们旨在基于数据集规模和不平衡之间的相互作用，为最佳选择提供见解，丰富对旨在实现最佳模型性能的利益相关者的讨论。

局限性

认识到单个模型的泛化局限性，我们断言需要对不同的数据集进行进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a65f/11353409/7f85c1b0e85e/diagnostics-14-01727-g001.jpg

相似文献

Exploring the Interplay of Dataset Size and Imbalance on CNN Performance in Healthcare: Using X-rays to Identify COVID-19 Patients.

Diagnostics (Basel). 2024 Aug 8;14(16):1727. doi: 10.3390/diagnostics14161727.

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

CNN-Bi-LSTM: A Complex Environment-Oriented Cattle Behavior Classification Network Based on the Fusion of CNN and Bi-LSTM.

Sensors (Basel). 2023 Sep 6;23(18):7714. doi: 10.3390/s23187714.

Batch-balanced focal loss: a hybrid solution to class imbalance in deep learning.

J Med Imaging (Bellingham). 2023 Sep;10(5):051809. doi: 10.1117/1.JMI.10.5.051809. Epub 2023 Jun 23.

An automated diagnosis and classification of COVID-19 from chest CT images using a transfer learning-based convolutional neural network.

Comput Biol Med. 2022 May;144:105383. doi: 10.1016/j.compbiomed.2022.105383. Epub 2022 Mar 10.

Classification of COVID-19 chest X-Ray and CT images using a type of dynamic CNN modification method.

Comput Biol Med. 2021 Jul;134:104425. doi: 10.1016/j.compbiomed.2021.104425. Epub 2021 Apr 29.

A hybrid feature weighted attention based deep learning approach for an intrusion detection system using the random forest algorithm.

PLoS One. 2024 May 23;19(5):e0302294. doi: 10.1371/journal.pone.0302294. eCollection 2024.

COVID-19 lateral flow test image classification using deep CNN and StyleGAN2.

Front Artif Intell. 2024 Jan 29;6:1235204. doi: 10.3389/frai.2023.1235204. eCollection 2023.

Application of high resolution computed tomography image assisted classification model of middle ear diseases based on 3D-convolutional neural network.

Zhong Nan Da Xue Xue Bao Yi Xue Ban. 2022 Aug 28;47(8):1037-1048. doi: 10.11817/j.issn.1672-7347.2022.210704.

SVD-CLAHE boosting and balanced loss function for Covid-19 detection from an imbalanced Chest X-Ray dataset.

Comput Biol Med. 2022 Nov;150:106092. doi: 10.1016/j.compbiomed.2022.106092. Epub 2022 Sep 28.

引用本文的文献

DCNN models with post-hoc interpretability for the automated detection of glossitis and OSCC on the tongue.

Sci Rep. 2025 Aug 29;15(1):31940. doi: 10.1038/s41598-025-16760-5.

本文引用的文献

A Sustainable Approach to Asthma Diagnosis: Classification with Data Augmentation, Feature Selection, and Boosting Algorithm.

Diagnostics (Basel). 2024 Mar 29;14(7):723. doi: 10.3390/diagnostics14070723.

Dermo-Seg: ResNet-UNet Architecture and Hybrid Loss Function for Detection of Differential Patterns to Diagnose Pigmented Skin Lesions.

Diagnostics (Basel). 2023 Sep 12;13(18):2924. doi: 10.3390/diagnostics13182924.

A New Weighted Deep Learning Feature Using Particle Swarm and Ant Lion Optimization for Cervical Cancer Diagnosis on Pap Smear Images.

Diagnostics (Basel). 2023 Aug 25;13(17):2762. doi: 10.3390/diagnostics13172762.

Thoracic imaging tests for the diagnosis of COVID-19.

Cochrane Database Syst Rev. 2022 May 16;5(5):CD013639. doi: 10.1002/14651858.CD013639.pub5.

Diagnostics for COVID-19: moving from pandemic response to control.

Lancet. 2022 Feb 19;399(10326):757-768. doi: 10.1016/S0140-6736(21)02346-1. Epub 2021 Dec 20.

Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models.

Comput Methods Programs Biomed. 2022 Jan;213:106504. doi: 10.1016/j.cmpb.2021.106504. Epub 2021 Oct 28.

A new approach for computer-aided detection of coronavirus (COVID-19) from CT and X-ray images using machine learning methods.

Appl Soft Comput. 2021 Jul;105:107323. doi: 10.1016/j.asoc.2021.107323. Epub 2021 Mar 17.

COVID-19 Detection from Chest X-ray Images Using Feature Fusion and Deep Learning.

Sensors (Basel). 2021 Feb 20;21(4):1480. doi: 10.3390/s21041480.

Effectiveness of COVID-19 diagnosis and management tools: A review.

Radiography (Lond). 2021 May;27(2):682-687. doi: 10.1016/j.radi.2020.09.010. Epub 2020 Sep 21.

Active case finding with case management: the key to tackling the COVID-19 pandemic.

Lancet. 2020 Jul 4;396(10243):63-70. doi: 10.1016/S0140-6736(20)31278-2. Epub 2020 Jun 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

探索医疗保健领域中数据集大小和不平衡对卷积神经网络性能的相互作用：利用X光识别新冠肺炎患者。

Exploring the Interplay of Dataset Size and Imbalance on CNN Performance in Healthcare: Using X-rays to Identify COVID-19 Patients.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSIONS

MOTIVATION

LIMITATIONS

引言

方法

结果

结论

动机

局限性

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献