Batra Shivani, Khurana Rohan, Khan Mohammad Zubair, Boulila Wadii, Koubaa Anis, Srivastava Prakash
Department of Computer Science and Engineering, KIET Group of Institutions, Delhi-NCR, Ghaziabad 201206, India.
Department of Computer Science and Information, Taibah University, Medina 42353, Saudi Arabia.
Entropy (Basel). 2022 Apr 10;24(4):533. doi: 10.3390/e24040533.
Pristine and trustworthy data are required for efficient computer modelling for medical decision-making, yet data in medical care is frequently missing. As a result, missing values may occur not just in training data but also in testing data that might contain a single undiagnosed episode or a participant. This study evaluates different imputation and regression procedures identified based on regressor performance and computational expense to fix the issues of missing values in both training and testing datasets. In the context of healthcare, several procedures are introduced for dealing with missing values. However, there is still a discussion concerning which imputation strategies are better in specific cases. This research proposes an ensemble imputation model that is educated to use a combination of simple mean imputation, k-nearest neighbour imputation, and iterative imputation methods, and then leverages them in a manner where the ideal imputation strategy is opted among them based on attribute correlations on missing value features. We introduce a unique Ensemble Strategy for Missing Value to analyse healthcare data with considerable missing values to identify unbiased and accurate prediction statistical modelling. The performance metrics have been generated using the eXtreme gradient boosting regressor, random forest regressor, and support vector regressor. The current study uses real-world healthcare data to conduct experiments and simulations of data with varying feature-wise missing frequencies indicating that the proposed technique surpasses standard missing value imputation approaches as well as the approach of dropping records holding missing values in terms of accuracy.
高效的医学决策计算机建模需要原始且可靠的数据,但医疗保健中的数据经常缺失。因此,缺失值不仅可能出现在训练数据中,还可能出现在可能包含单个未诊断病例或参与者的测试数据中。本研究评估了基于回归器性能和计算成本确定的不同插补和回归程序,以解决训练和测试数据集中的缺失值问题。在医疗保健领域,已经引入了几种处理缺失值的程序。然而,对于在特定情况下哪种插补策略更好仍存在讨论。本研究提出了一种集成插补模型,该模型通过使用简单均值插补、k近邻插补和迭代插补方法的组合进行训练,然后根据缺失值特征的属性相关性在这些方法中选择理想的插补策略来利用它们。我们引入了一种独特的缺失值集成策略,用于分析具有大量缺失值的医疗保健数据,以识别无偏且准确的预测统计模型。使用极端梯度提升回归器、随机森林回归器和支持向量回归器生成性能指标。当前研究使用真实世界的医疗保健数据进行实验和模拟,数据具有不同的按特征缺失频率,结果表明所提出的技术在准确性方面优于标准的缺失值插补方法以及丢弃包含缺失值记录的方法。