Davidian Moshe, Lahav Adi, Joshua Ben-Zion, Wand Ori, Lurie Yotam, Mark Shlomo
Guilford Glazer Faculty of Business and Management, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.
Software Engineering Department, SCE-Shamoon College of Engineering, Beer-Sheva 84100, Israel.
Diagnostics (Basel). 2024 Aug 8;14(16):1727. doi: 10.3390/diagnostics14161727.
Convolutional Neural Network (CNN) systems in healthcare are influenced by unbalanced datasets and varying sizes. This article delves into the impact of dataset size, class imbalance, and their interplay on CNN systems, focusing on the size of the training set versus imbalance-a unique perspective compared to the prevailing literature. Furthermore, it addresses scenarios with more than two classification groups, often overlooked but prevalent in practical settings.
Initially, a CNN was developed to classify lung diseases using X-ray images, distinguishing between healthy individuals and COVID-19 patients. Later, the model was expanded to include pneumonia patients. To evaluate performance, numerous experiments were conducted with varied data sizes and imbalance ratios for both binary and ternary classifications, measuring various indices to validate the model's efficacy.
The study revealed that increasing dataset size positively impacts CNN performance, but this improvement saturates beyond a certain size. A novel finding is that the data balance ratio influences performance more significantly than dataset size. The behavior of three-class classification mirrored that of binary classification, underscoring the importance of balanced datasets for accurate classification.
This study emphasizes the fact that achieving balanced representation in datasets is crucial for optimal CNN performance in healthcare, challenging the conventional focus on dataset size. Balanced datasets improve classification accuracy, both in two-class and three-class scenarios, highlighting the need for data-balancing techniques to improve model reliability and effectiveness.
Our study is motivated by a scenario with 100 patient samples, offering two options: a balanced dataset with 200 samples and an unbalanced dataset with 500 samples (400 healthy individuals). We aim to provide insights into the optimal choice based on the interplay between dataset size and imbalance, enriching the discourse for stakeholders interested in achieving optimal model performance.
Recognizing a single model's generalizability limitations, we assert that further studies on diverse datasets are needed.
医疗保健领域的卷积神经网络(CNN)系统受到不平衡数据集和不同规模的影响。本文深入探讨了数据集规模、类别不平衡及其相互作用对CNN系统的影响,重点关注训练集规模与不平衡之间的关系——这是一个与现有文献相比独特的视角。此外,它还探讨了具有两个以上分类组的情况,这种情况在实际应用中经常被忽视但却很普遍。
最初,开发了一个CNN,用于使用X射线图像对肺部疾病进行分类,区分健康个体和新冠肺炎患者。后来,该模型扩展到包括肺炎患者。为了评估性能,针对二分类和三分类,使用不同的数据规模和不平衡率进行了大量实验,测量各种指标以验证模型的有效性。
研究表明,增加数据集规模对CNN性能有积极影响,但这种改进在超过一定规模后会趋于饱和。一个新发现是,数据平衡率比数据集规模对性能的影响更大。三分类的表现与二分类相似,强调了平衡数据集对于准确分类的重要性。
本研究强调了在数据集中实现平衡表示对于医疗保健领域中CNN的最佳性能至关重要,这对传统上对数据集规模的关注提出了挑战。平衡数据集在二分类和三分类场景中都提高了分类准确性,突出了需要数据平衡技术来提高模型的可靠性和有效性。
我们的研究是受一个有100个患者样本的场景驱动的,提供了两个选项:一个有200个样本的平衡数据集和一个有500个样本(400个健康个体)的不平衡数据集。我们旨在基于数据集规模和不平衡之间的相互作用,为最佳选择提供见解,丰富对旨在实现最佳模型性能的利益相关者的讨论。
认识到单个模型的泛化局限性,我们断言需要对不同的数据集进行进一步研究。