Barukab Omar, Ahmad Amir, Khan Tabrej, Thayyil Kunhumuhammed Mujeeb Rahiman
Department of Information Technology, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
College of Information Technology, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates.
Diagnostics (Basel). 2022 Nov 30;12(12):3000. doi: 10.3390/diagnostics12123000.
Parkinson's disease (PD) currently affects approximately 10 million people worldwide. The detection of PD positive subjects is vital in terms of disease prognostics, diagnostics, management and treatment. Different types of early symptoms, such as speech impairment and changes in writing, are associated with Parkinson disease. To classify potential patients of PD, many researchers used machine learning algorithms in various datasets related to this disease. In our research, we study the dataset of the PD vocal impairment feature, which is an imbalanced dataset. We propose comparative performance evaluation using various decision tree ensemble methods, with or without oversampling techniques. In addition, we compare the performance of classifiers with different sizes of ensembles and various ratios of the minority class and the majority class with oversampling and undersampling. Finally, we combine feature selection with best-performing ensemble classifiers. The result shows that AdaBoost, random forest, and decision tree developed for the RUSBoost imbalanced dataset perform well in performance metrics such as precision, recall, F1-score, area under the receiver operating characteristic curve (AUROC) and the geometric mean. Further, feature selection methods, namely lasso and information gain, were used to screen the 10 best features using the best ensemble classifiers. AdaBoost with information gain feature selection method is the best performing ensemble method with an F1-score of 0.903.
帕金森病(PD)目前在全球影响着约1000万人。PD阳性患者的检测在疾病预后、诊断、管理和治疗方面至关重要。不同类型的早期症状,如言语障碍和书写变化,都与帕金森病相关。为了对PD潜在患者进行分类,许多研究人员在与该疾病相关的各种数据集中使用了机器学习算法。在我们的研究中,我们研究了PD语音损伤特征的数据集,这是一个不平衡数据集。我们提出使用各种决策树集成方法进行比较性能评估,有无过采样技术均可。此外,我们比较了不同大小集成以及少数类与多数类的不同比例在过采样和欠采样情况下分类器的性能。最后,我们将特征选择与性能最佳的集成分类器相结合。结果表明,为RUSBoost不平衡数据集开发的AdaBoost、随机森林和决策树在精度、召回率、F1分数、接收器操作特征曲线下面积(AUROC)和几何均值等性能指标方面表现良好。此外,使用套索和信息增益等特征选择方法,通过最佳集成分类器筛选出10个最佳特征。采用信息增益特征选择方法的AdaBoost是性能最佳的集成方法,F1分数为0.903。