Department of Computer Science, Faculty of Science and Technology, Thammasat University (Rangsit Campus), Pathum Thani, Thailand.
Thammasat University Research Unit in Data Innovation and Artificial Intelligence, Thammasat University (Rangsit Campus), Pathum Thani, Thailand.
PLoS One. 2024 Aug 29;19(8):e0305492. doi: 10.1371/journal.pone.0305492. eCollection 2024.
Existing missing value imputation methods focused on imputing the data regarding actual values towards a completion of datasets as an input for machine learning tasks. This work proposes an imputation of missing values towards improvement of accuracy performance for classification. The proposed method was based on bee algorithm and the use of k-nearest neighborhood with linear regression to guide on finding the appropriate solution in prevention of randomness. Among the processes, GINI importance score was utilized in selecting values for imputation. The imputed values thus reflected on improving a discriminative power in classification tasks instead of replicating the actual values from the original dataset. In this study, we evaluated the proposed method against frequently used imputation methods such as k-nearest neighborhood, principal components analysis, nonlinear principal, and component analysis to compare root mean square error results and accuracy of using imputed datasets in a classification task. The experimental results indicated that our proposed method obtained the best accuracy results from all datasets comparing to other methods. In comparison to original dataset, the classification model from imputed datasets yielded 15-25% higher accuracy in class prediction. From analysis, the results showed that feature ranking used in a classification process was affected and lead to noticeably change in informativeness as the imputed data from the proposed method played the role to boost a discriminating power.
现有的缺失值插补方法主要集中在将实际值的数据插补为机器学习任务的输入数据集的完整化。这项工作提出了一种缺失值插补方法,旨在提高分类的准确性性能。所提出的方法基于蜜蜂算法,并使用 k-最近邻和线性回归来指导寻找合适的解决方案,以防止随机性。在这些过程中,基尼重要性得分用于选择插补的值。因此,插补的值反映了在分类任务中提高判别能力,而不是复制原始数据集中的实际值。在这项研究中,我们评估了所提出的方法与常用的插补方法(如 k-最近邻、主成分分析、非线性主成分和组件分析)相比,以比较均方根误差结果和使用插补数据集在分类任务中的准确性。实验结果表明,与其他方法相比,我们提出的方法从所有数据集获得了最佳的准确性结果。与原始数据集相比,插补数据集的分类模型在类预测中产生了 15-25%的更高准确性。从分析结果可以看出,在分类过程中使用的特征排序受到影响,并导致信息量发生明显变化,因为所提出的方法的插补数据起到了提高判别能力的作用。