Bhat Sameer, Szczuko Piotr
Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems Department, Gdansk University of Technology, Narutowicza 11/12, 80-233, Gdansk, Poland.
Sci Rep. 2025 May 12;15(1):16413. doi: 10.1038/s41598-025-98356-7.
This study investigates the classification of individuals as healthy or at risk of Parkinson's disease using machine learning (ML) models, focusing on the impact of dataset size and preprocessing techniques on model performance. Four datasets are created from an original dataset: [Formula: see text] (normal dataset), [Formula: see text] ([Formula: see text] subjected to Canny edge detection and Hessian filtering), [Formula: see text] (augmented [Formula: see text]), and [Formula: see text] (augmented [Formula: see text]). We evaluate a range of ML models-Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB), XGBoost (XBG), Naive Bayes (NB), Support Vector Machine (SVM), and AdaBoost (AdB)-on these datasets, analyzing prediction accuracy, model size, and prediction latency. The results show that while larger datasets lead to increased model memory footprints and prediction latencies, the Canny edge detection preprocessing supplemented by Hessian filtering (used in [Formula: see text] and [Formula: see text]) degrades the performance of most models. In our experiment, we observe that Random Forest (RF) maintains a stable memory footprint of 61 KB across all datasets, while models like KNN and SVM show significant increases in memory usage, from 5.7-7 KB on [Formula: see text] to 102-220 KB on [Formula: see text], and similar increases in prediction time. Logistic Regression, Decision Tree, and Naive Bayes show stable memory footprints and fast prediction times across all datasets. XGBoost's prediction time increases from 180-200 ms on [Formula: see text] to 700-3000 ms on [Formula: see text]. Statistical analysis using the Mann-Whitney U test with 100 prediction accuracy observations per model (98 degrees of freedom) reveals significant differences in performance between models trained on [Formula: see text] and [Formula: see text] (p-values < 1e-34 for most models), while the effect sizes measured by estimating Cliff's delta values (approaching [Formula: see text]) indicate large shifts in performance, especially for SVM and XGBoost. These findings highlight the importance of selecting lightweight models like LR and DT for deployment in resource-constrained healthcare applications, as models like KNN, SVM, and XGBoost show significant increases in resource demands with larger datasets, particularly when Canny preprocessing is applied.
本研究使用机器学习(ML)模型调查个体被分类为健康或有帕金森病风险的情况,重点关注数据集大小和预处理技术对模型性能的影响。从一个原始数据集创建了四个数据集:[公式:见文本](正常数据集)、[公式:见文本]([公式:见文本]经过Canny边缘检测和Hessian滤波)、[公式:见文本](增强后的[公式:见文本])和[公式:见文本](增强后的[公式:见文本])。我们在这些数据集上评估了一系列ML模型——逻辑回归(LR)、决策树(DT)、随机森林(RF)、梯度提升(GB)、XGBoost(XBG)、朴素贝叶斯(NB)、支持向量机(SVM)和AdaBoost(AdB)——分析预测准确性、模型大小和预测延迟。结果表明,虽然更大的数据集会导致模型内存占用和预测延迟增加,但由Hessian滤波补充的Canny边缘检测预处理(用于[公式:见文本]和[公式:见文本])会降低大多数模型的性能。在我们的实验中,我们观察到随机森林(RF)在所有数据集中保持稳定的61 KB内存占用,而像KNN和SVM这样的模型显示内存使用显著增加,从[公式:见文本]上的5.7 - 7 KB增加到[公式:见文本]上的102 - 220 KB,预测时间也有类似增加。逻辑回归、决策树和朴素贝叶斯在所有数据集中显示出稳定的内存占用和快速的预测时间。XGBoost的预测时间从[公式:见文本]上的180 - 200毫秒增加到[公式:见文本]上的700 - 3000毫秒。使用Mann - Whitney U检验进行统计分析,每个模型有100个预测准确性观测值(98个自由度),结果显示在[公式:见文本]和[公式:见文本]上训练的模型之间性能存在显著差异(大多数模型的p值< 1e - 34),而通过估计Cliff's delta值(接近[公式:见文本])测量的效应大小表明性能有很大变化,特别是对于SVM和XGBoost。这些发现突出了选择像LR和DT这样的轻量级模型用于资源受限的医疗保健应用部署的重要性,因为像KNN、SVM和XGBoost这样的模型在处理更大数据集时资源需求显著增加,特别是在应用Canny预处理时。